Exception: Size of sample #3996 is invalid (={'fr-en': (1619, 0)}) since max_positions={'fr-en': (1024, 1024)},

ever4244 commented 4 years ago

Hi: I am testing the embedding on the MLdoc task and I have been trying to connect this embedding with the default MLdoc task in the LASER folder.

I insert a new function in the LASER mldoc.py `def encode_file_lw(input_fn,output_fn,lang,buffer_size):

print ('enter encode_file_lw')

import argparse
parser = options.get_generation_parser(interactive=False)
parser.add_argument('--buffer_size',  type=int, required=True,
                    help='buffer_size')
parser.add_argument('--input', required=True,
                    help='input sentence file')
parser.add_argument('--output-file', required=True,
                    help='Output sentence embeddings')
parser.add_argument('--spm-model',
                    help='(optional) Path to SentencePiece model')
#parser = argparse.ArgumentParser(description='liwei mod.')
#input_fn='./data-bin/valid.de-en_liwtest.de'
data='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/data-bin/iwslt17.de_fr.en.bpe16k' 
path='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/checkpoints//laser_lstm5_newcodetest/checkpoint_best.pt'
#output_file='iwslt17.valid.de-en.de.enc'
spm_model='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/examples/translation/iwslt17.de_fr.en.bpe16k/sentencepiece.bpe.model' 
#buffer_size='4000'
batch_size='128'
if lang=='fr':
  tar_lang='en'
if lang=='de':
  tar_lang='en'
para_ls=[data,\
        '--input',input_fn, \
        '--task', 'translation_laser',\
        '--lang-pairs','de-en,fr-en',\
        '--path', path, \
        '--source-lang', lang,\
        '--target-lang', tar_lang ,\
        '--path',path,\
        '--buffer_size', str(buffer_size), \
        '--batch-size', batch_size, \
        '--output-file',output_fn, \
        '--spm-model', spm_model \
        ]
        #'--batch-size', batch_size, \
#foo_parser.parse_args(['--parent', '2', 'XXX'])
#args = parser.parse_args(para_ls)
args = options.parse_args_and_arch(parser,input_args=para_ls)
#print ('args ={}'.format(args))

embed_liwei.main(args)

print ('exit encode_file_lw')
return

`

And I also use it to replace the BPE function and embedding function in the original LASER processing process.


'''
print('\nProcessing:')

#print ('enter cli_main_lwtest')
#cli_main_lwtest()
#print ('exit cli_main_lwtest')

for part in ('train.1000', 'dev', 'test'):
    # for lang in "en" if part == 'train1000' else args.lang:
    for lang in args.lang:
        cfname = os.path.join(args.data_dir, 'mldoc.' + part)
        Token(cfname + '.txt.' + lang,
              cfname + '.tok.' + lang,
              lang=lang,
              romanize=(True if lang == 'el' else False),
              lower_case=True, gzip=False,
              verbose=args.verbose, over_write=False)
        SplitLines(cfname + '.tok.' + lang,
                   cfname + '.split.' + lang,
                   cfname + '.sid.' + lang)

      #BPEfastApply(cfname + '.split.' + lang,
                     cfname + '.split.bpe.' + lang,
                     args.bpe_codes,
                     verbose=args.verbose, over_write=False)

      #apply bpe that I replaced     

       #EncodeFile(enc,
                   cfname + '.split.bpe.' + lang,
                   cfname + '.split.enc.' + lang,
                   verbose=args.verbose, over_write=False,
                   buffer_size=args.buffer_size)

         #the encode that I replaced      

        encode_file_lw(input_fn=cfname + '.split.' + lang,output_fn= cfname + '.split.enc.' + lang,lang=lang,buffer_size=args.buffer_size)

       # I use this function to replace the BPE and encode, this function as cited above is a wrapper of your embedding function (the previous version)

        JoinEmbed(cfname + '.split.enc.' + lang,
                  cfname + '.sid.' + lang,
                  cfname + '.enc.' + lang)

It seems to be alright when producing English and German files, but I encountered this error when producing french files. Do you have an idea what could cause this?

BTW: I saw you have updated the codes and readme. In the previous version I used, it seems that your embedding function would take in normal txt file and convert it into embeddings.

However, from what I read in your bucc task, you apply BPE first in the current version of codes. Does that mean If I use the version, I should also apply BPE first?

Extracting MLDoc data
LASER: calculate embeddings for MLDoc
 - loading encoder /home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/LASER_lw/models/bilstm.93langs.2018-12-26.pt

Processing:
 - SplitLines: embed/mldoc.train.1000.split.de already exists
enter encode_file_lw
Namespace(beam=5, bpe=None, buffer_size=4000, cpu=False, criterion='cross_entropy', data='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/data-bin/iwslt17.de_fr.en.bpe16k', dataset_impl=None, decoder_langtok=False, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, encoder_langtok=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', input='embed/mldoc.train.1000.split.de', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, iter_decode_with_beam=1, iter_decode_with_external_reranker=False, lang_pairs='de-en,fr-en', lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', output_file='embed/mldoc.train.1000.split.enc.de', path='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/checkpoints//laser_lstm5_newcodetest/checkpoint_best.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, results_path=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=True, source_lang='de', spm_model='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/examples/translation/iwslt17.de_fr.en.bpe16k/sentencepiece.bpe.model', target_lang='en', task='translation_laser', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 13880 types
| [en] dictionary: 13880 types
| [fr] dictionary: 13880 types
| loading model(s) from /home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/checkpoints//laser_lstm5_newcodetest/checkpoint_best.pt
/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/models/fairseq_model.py:280: UserWarning: FairseqModel is deprecated, please use FairseqEncoderDecoderModel or BaseFairseqModel instead
  for key in self.keys
| Sentence buffer size: 4000
| Reading input sentence from stdin
exit encode_file_lw
 - JoinEmbed: embed/mldoc.train.1000.enc.de already exists
 - SplitLines: embed/mldoc.train.1000.split.fr already exists
enter encode_file_lw
Namespace(beam=5, bpe=None, buffer_size=4000, cpu=False, criterion='cross_entropy', data='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/data-bin/iwslt17.de_fr.en.bpe16k', dataset_impl=None, decoder_langtok=False, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, encoder_langtok=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', input='embed/mldoc.train.1000.split.fr', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, iter_decode_with_beam=1, iter_decode_with_external_reranker=False, lang_pairs='de-en,fr-en', lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', output_file='embed/mldoc.train.1000.split.enc.fr', path='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/checkpoints//laser_lstm5_newcodetest/checkpoint_best.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, results_path=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=True, source_lang='fr', spm_model='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/examples/translation/iwslt17.de_fr.en.bpe16k/sentencepiece.bpe.model', target_lang='en', task='translation_laser', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 13880 types
| [en] dictionary: 13880 types
| [fr] dictionary: 13880 types
| loading model(s) from /home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/checkpoints//laser_lstm5_newcodetest/checkpoint_best.pt
| Sentence buffer size: 4000
| Reading input sentence from stdin
Traceback (most recent call last):
  File "mldoc.py", line 166, in <module>
    encode_file_lw(input_fn=cfname + '.split.' + lang,output_fn= cfname + '.split.enc.' + lang,lang=lang,buffer_size=args.buffer_size)
  File "mldoc.py", line 92, in encode_file_lw
    embed_liwei.main(args)
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/embed_liwei.py", line 112, in main
    for batch in make_batches(inputs, args, task, max_positions, encode_fn):
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/embed_liwei.py", line 43, in make_batches
    max_positions=max_positions,
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/tasks/fairseq_task.py", line 150, in get_batch_iterator
    indices, dataset, max_positions, raise_exception=(not ignore_invalid_inputs),
  File "/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/fairseq/data/data_utils.py", line 188, in filter_by_size
    ).format(ignored[0], dataset.size(ignored[0]), max_positions))
Exception: Size of sample #3996 is invalid (={'fr-en': (1619, 0)}) since max_positions={'fr-en': (1024, 1024)}, skip this example with --skip-invalid-size-inputs-valid-test

raymondhs commented 4 years ago

The input text should be preprocessed in the same way as when you train the LASER model. In my previous version, if you specify --spm-model the code will do the SentencePiece tokenization internally.

Exception: Size of sample #3996 is invalid (={'fr-en': (1619, 0)}) since max_positions={'fr-en': (1024, 1024)}, skip this example with --skip-invalid-size-inputs-valid-test

This error suggests that that particular instance is too long (> 1024 tokens), so you can also check that line in your data.

ever4244 commented 4 years ago

The input text should be preprocessed in the same way as when you train the LASER model. In my previous version, if you specify --spm-model the code will do the SentencePiece tokenization internally.

Exception: Size of sample #3996 is invalid (={'fr-en': (1619, 0)}) since max_positions={'fr-en': (1024, 1024)}, skip this example with --skip-invalid-size-inputs-valid-test

This error suggests that that particular instance is too long (> 1024 tokens), so you can also check that line in your data.

Thank you. I specified --spm-model I have set the buffer size to 4000, would it be helpful if I assign more buffer size? Where does the max token limitation come from, How can I change it with parameter setting? Can I change it by setting --max-token?

raymondhs commented 4 years ago

--max-token and --buffer-size are for the batch size. Could you try specifying --max-source-positions 1000000 and see if it still throws an error?(https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/multilingual_translation.py#L81-L82).

ever4244 commented 4 years ago

--max-token and --buffer-size are for the batch size. Could you try specifying --max-source-positions 1000000 and see if it still throws an error?(https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/multilingual_translation.py#L81-L82).

Thank you. --max-source-positions worked. Just very slow

I got a fairly bad french and German MLdoc task result. (Although I train the model with a small dataset compared to euperal, the result is no better than a random guess, so I wonder if I have not use your embed.py correctly). I replace the encoder part in the LASER codes, with your embed.py

i.e.: my modification of mldoc.py in the LASER: Do you think my way of using your embedding correct?


'''''
'''''
print('\nProcessing:')
for part in ('train.1000', 'dev', 'test'):
    # for lang in "en" if part == 'train1000' else args.lang:
    for lang in args.lang:
        cfname = os.path.join(args.data_dir, 'mldoc.' + part)
        Token(cfname + '.txt.' + lang,
              cfname + '.tok.' + lang,
              lang=lang,
              romanize=(True if lang == 'el' else False),
              lower_case=True, gzip=False,
              verbose=args.verbose, over_write=False)
        SplitLines(cfname + '.tok.' + lang,
                   cfname + '.split.' + lang,
                   cfname + '.sid.' + lang)

        #LASER code comment out:
        #BPEfastApply(cfname + '.split.' + lang,
        #             cfname + '.split.bpe.' + lang,
        #             args.bpe_codes,
        #             verbose=args.verbose, over_write=False)
        #EncodeFile(enc,
        #           cfname + '.split.bpe.' + lang,
        #           cfname + '.split.enc.' + lang,
        #           verbose=args.verbose, over_write=False,
        #           buffer_size=args.buffer_size)

        #I comment out the orginal LASER BPE and encode and replace it with a wrapper of your embed.py 
        #my code added:
        encode_file_lw(input_fn=cfname + '.split.' + lang,output_fn= cfname + '.split.enc.' + lang,lang=lang,buffer_size=args.buffer_size)
        #encode_file_lw(input_fn=cfname + '.split.bpe.' + lang,output_fn= cfname + '.split.enc.' + lang,lang=lang,buffer_size=args.buffer_size)

        JoinEmbed(cfname + '.split.enc.' + lang,
                  cfname + '.sid.' + lang,
                  cfname + '.enc.' + lang)

def encode_file_lw(input_fn,output_fn,lang,buffer_size):

    print ('enter encode_file_lw')

    import argparse
    parser = options.get_generation_parser(interactive=False)
    parser.add_argument('--buffer_size',  type=int, required=True,
                        help='buffer_size')
    parser.add_argument('--input', required=True,
                        help='input sentence file')
    parser.add_argument('--output-file', required=True,
                        help='Output sentence embeddings')
    parser.add_argument('--spm-model',
                        help='(optional) Path to SentencePiece model')

    data='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/data-bin/iwslt17.de_fr.en.bpe16k' 
    path='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/checkpoints//laser_lstm5_newcodetest/checkpoint_best.pt'

    spm_model='/home/wei/LIWEI_workspace/fairseq_liweimod/fairseq/examples/translation/iwslt17.de_fr.en.bpe16k/sentencepiece.bpe.model' 

    batch_size='128'
    if lang=='fr':
      tar_lang='en'
    if lang=='de':
      tar_lang='en'
    para_ls=[data,\
            '--input',input_fn, \
            '--task', 'translation_laser',\
            '--lang-pairs','de-en,fr-en',\
            '--path', path, \
            '--source-lang', lang,\
            '--target-lang', tar_lang ,\
            '--path',path,\
            '--buffer_size', '12000', \
            '--max-source-positions', '10000', \
            '--output-file',output_fn, \
            '--spm-model', spm_model \
            ]

    args = options.parse_args_and_arch(parser,input_args=para_ls)

    embed_liwei.main(args)

    print ('exit encode_file_lw')
    return

BTW. What's new in your new version of codes? except for bucc experiment? Would I be better off with this version? ( I am training with eupearl and testing on MLDOC)

raymondhs commented 4 years ago

This is an example how I made use of embed.py: https://github.com/raymondhs/fairseq-laser/blob/master/bucc.sh#L96-L118. Just need to make sure that the processing steps (the perl scripts + BPE) are identical to how your LASER training data was created.

I think you can try the similarity task to check the error rate for each language pair. Here's my previous scripts for testing. The error rate should be low enough (perhaps <2%) in order to have a reasonable multilingual embedding for the other tasks. It is indeed possible that the IWSLT data might be too little to get good enough embedding (original LASER data was huge..)

In the new version, model updates during training are performed after each batch from one language pair (instead of after all batches from all language pairs like in Fairseq's MultilingualTask). Yes, you can try training the Europarl model with the newer version. I have not tried them on MLdoc task but it gets a reasonable performance on BUCC.

raymondhs / fairseq-laser

Exception: Size of sample #3996 is invalid (={'fr-en': (1619, 0)}) since max_positions={'fr-en': (1024, 1024)}, #3