thompsonb / prism

MT Evaluation in Many Languages via Zero-Shot Paraphrasing
Other
102 stars 23 forks source link

generate_paraphrases.py raises ValueError on blank line #4

Closed tuhinjubcse closed 3 years ago

tuhinjubcse commented 3 years ago
| [src] dictionary: 65400 types
| [tgt] dictionary: 65400 types
| loaded 54667 examples from: test_bin/test.src-tgt.src
| loaded 54667 examples from: test_bin/test.src-tgt.tgt
| test_bin test src-tgt 54667 examples
| loading model(s) from m39v1//checkpoint.pt
Traceback (most recent call last):                                                                                                                                                                          
  File "generate_paraphrases.py", line 478, in <module>
    cli_main()
  File "generate_paraphrases.py", line 474, in cli_main
    main(args)
  File "generate_paraphrases.py", line 374, in main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 265, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 113, in generate
    return self._generate(model, sample, **kwargs)
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 295, in _generate
    tokens[:, :step + 1], encoder_outs, temperature=self.temperature,
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 565, in forward_decoder
    temperature=temperature,
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 587, in _decode_one
    decoder_out = list(model.forward_decoder(tokens, encoder_out=encoder_out))
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/fairseq/models/fairseq_model.py", line 228, in forward_decoder
    return self.decoder(prev_output_tokens, **kwargs)
  File "/home/tuhin.chakr/yes/envs/prismenv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "generate_paraphrases.py", line 220, in forward
    max_prefix_len = max([len(prefix) for prefix in penalties])
ValueError: max() arg is an empty sequence

my test.src file looks fine

▁С во ей ▁рук ой ▁у брал ▁я ▁со ▁стол а . ▁Ле са , ▁недавно ▁столь ▁ густ ые , ▁Вы ▁собира ете ся ▁в да ль ? ▁Как ой - ни буд ь ▁не прав ед ный ▁из ги б ▁О ▁чем - то ▁г рус тил ▁я , ▁чем у - то ▁сме я лся , ▁На клон ились ▁на до ▁м ной ▁ сон ные ▁си дел ки . ▁Но чь ▁тих а . ▁Пу сты ня ▁в нем лет ▁бог у , ▁Бо город и цу ▁мол ить ... ▁Гор ды й ▁в зор ▁ин оп лем енный , ▁В ▁к руж ев ах ▁и ▁бел ой ▁ки се е . ▁По став ь те , ▁не вольн ики ▁во ли , ▁В ▁них ▁и ▁во все ▁не ▁ гляд е ть . ▁И ▁всем ▁каза лось , ▁что ▁рад ость ▁будет , ▁Не под ков ан ных ▁ко пыт . ▁С казал , ▁что ▁у ▁меня ▁со пер ниц ▁нет . ▁Но чная ▁до чь ▁и ных ▁в рем ён . ▁Те ни ст ▁и ▁х ла ден ▁их ▁по рог , ▁На ▁пари ж ский ▁че рд ак ▁за гна ла . ▁Во стор жен ных ▁по х вал ▁про й дет ▁минут ный ▁шум ; ▁Про тяж но ▁по ет ▁и ▁у ны ло . ▁В ▁па ху чем ▁ту ман е ▁п лы в ут ... ▁А ▁в ▁какой - ни буд ь ▁ди кой ▁ще ли , ▁Послед ние ▁ дары ▁т во их ▁зем ных ▁за бот . ▁О кол дова на ▁ж ёл т ой ▁ лу ною : ▁И ▁сам а ▁я ▁не ▁стала ▁новой , ▁Я д ▁кап лет ▁ск воз ь ▁его ▁кор у ,

test.tgt too

generated test_bin folder too ``` (prismenv) tuhin.chakr@piranha:~/prism$ sh preprocess.sh Namespace(align_suffix=None, alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='test_bin', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='src', srcdict='m39v1//dict.tgt.txt', target_lang='tgt', task='translation', tensorboard_logdir='', testpref='test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='test', user_dir=None, validpref='test', workers=1) | [src] Dictionary: 65399 types | [src] test.src: 54667 sents, 645096 tokens, 0.0% replaced by | [src] Dictionary: 65399 types | [src] test.src: 54667 sents, 645096 tokens, 0.0% replaced by | [src] Dictionary: 65399 types | [src] test.src: 54667 sents, 645096 tokens, 0.0% replaced by | [tgt] Dictionary: 65399 types | [tgt] test.tgt: 54667 sents, 109334 tokens, 0.0% replaced by | [tgt] Dictionary: 65399 types | [tgt] test.tgt: 54667 sents, 109334 tokens, 0.0% replaced by | [tgt] Dictionary: 65399 types | [tgt] test.tgt: 54667 sents, 109334 tokens, 0.0% replaced by | Wrote preprocessed data to test_bin ```
tuhinjubcse commented 3 years ago

@thompsonb any idea how to resolve this ?

thompsonb commented 3 years ago

Can you share the file you are trying to paraphrase? Or better yet a subset of it that also fails

tuhinjubcse commented 3 years ago

https://github.com/tuhinjubcse/tuhinjubcse.github.io/blob/master/ru.txt

This file is the input file

https://github.com/tuhinjubcse/tuhinjubcse.github.io/blob/master/test.src is the file after applying SPM

thompsonb commented 3 years ago

It's failing on blank lines in your input file

tuhinjubcse commented 3 years ago

I can't see a blank line . Do you know which line number ? or do u mean sentences which have space in the beginning ?

thompsonb commented 3 years ago
for ii, line in enumerate(open('ru.txt')):
    if len(line.strip()) == 0:
        print(f'line {ii} is blank')

line 52672 is blank