rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 463 forks source link

AttributeError: 'BPE' object has no attribute 'glossaries_regex' #120

Open zwshan opened 7 months ago

zwshan commented 7 months ago

I am running the gnmt pytorch from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT, when I run

python3 translate.py   --model /workspace/autoFL/nvidia_gnmt_torch/nvidia_gnmtpyt_fp32_20190806.pth   --input /workspace/autoFL/GNMT/scripts/data/wmt16_de_en/newstest2014.en   --reference /workspace/autoFL/GNMT/scripts/data/wmt16_de_en/newstest2014.de   --output /tmp/output   --math fp32    --batch-size 128   --beam-size 1 2 5   --tables

there is a error

0: thread affinity: {0}
0: Run arguments: Namespace(affinity='single_unique', batch_first=True, batch_size=[128], beam_size=[1, 2, 5], bleu=True, cov_penalty_factor=0.1, cuda=True, cudnn=True, dllog_file='eval_log.json', env=False, input='/workspace/autoFL/GNMT/scripts/data/wmt16_de_en/newstest2014.en', input_text=None, len_norm_const=5.0, len_norm_factor=0.6, local_rank=0, math=['fp32'], max_seq_len=80, model='/workspace/autoFL/nvidia_gnmt_torch/nvidia_gnmtpyt_fp32_20190806.pth', output='/tmp/output', percentiles=(90, 95, 99), print_freq=1, rank=0, reference='/workspace/autoFL/GNMT/scripts/data/wmt16_de_en/newstest2014.de', repeat={128: 1}, save_dir='gnmt', sort=False, synthetic=False, synthetic_batches=64, synthetic_len=50, synthetic_vocab=32320, tables=True, target_bleu=None, target_perf=None, warmup=0)
0: Restoring state of the tokenizer
0: math: fp32, batch size: 128, beam size: 1
0: Running evaluation on test set
Traceback (most recent call last):
  File "translate.py", line 371, in <module>
    passed = main()
  File "translate.py", line 315, in main
    reference_path=args.reference,
  File "/workspace/autoFL/GNMT/seq2seq/inference/translator.py", line 123, in run
    warmup, summary)
  File "/workspace/autoFL/GNMT/seq2seq/inference/translator.py", line 184, in evaluate
    for i, (src, indices) in enumerate(loader):
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/workspace/autoFL/GNMT/seq2seq/data/dataset.py", line 158, in __getitem__
    tokenized = self.tokenizer.tokenize(raw)
  File "/workspace/autoFL/GNMT/seq2seq/data/tokenizer.py", line 136, in tokenize
    bpe = self.bpe.process_line(tokenized)
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/subword_nmt/apply_bpe.py", line 122, in process_line
    out += self.segment(line, dropout)
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/subword_nmt/apply_bpe.py", line 132, in segment
    segments = self.segment_tokens(sentence.strip('\r\n ').split(' '), dropout)
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/subword_nmt/apply_bpe.py", line 142, in segment_tokens
    new_word = [out for segment in self._isolate_glossaries(word)
  File "/root/anaconda3/envs/bonito/lib/python3.7/site-packages/subword_nmt/apply_bpe.py", line 150, in <listcomp>
    self.glossaries_regex,
AttributeError: 'BPE' object has no attribute 'glossaries_regex'

Could you please help me?

rsennrich commented 6 months ago

This is likely a version conflict.

GNMT lists commit 48ba99e657591c329e0003f0c6e32e493fa959ef in https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Translation/GNMT/requirements.txt , which does not yet have glossaries_regex.

I think what is happening is that GNMT saves the model including tokenizer (using commit 48ba99e657591c329e0003f0c6e32e493fa959ef), and you're then trying to run inference with a newer version of subword_nmt which expects different attributes. Installing commit 48ba99e657591c329e0003f0c6e32e493fa959ef should solve this.