pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 810 forks source link

arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] KeyError: None #592

Closed wpfnlp closed 5 years ago

wpfnlp commented 5 years ago

torchtext=0.4.0 BUG:

Traceback (most recent call last): File "/Users/weipengfei/workspaces/FastNLPProjects/research01/Intent+SlotFilling01.py", line 112, in for i, batch in enumerate(train_iter): File "/miniconda3/lib/python3.7/site-packages/torchtext/data/iterator.py", line 156, in iter yield Batch(minibatch, self.dataset, self.device) File "/miniconda3/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in init setattr(self, name, field.process(batch, device=device)) File "/miniconda3/lib/python3.7/site-packages/torchtext/data/field.py", line 237, in process tensor = self.numericalize(padded, device=device) File "/miniconda3/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in numericalize arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] File "/miniconda3/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] File "/miniconda3/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] KeyError: None

The same code torchtext=0.3.1 No problem, please tell me what caused it, thank you.

zhangguanheng66 commented 5 years ago

Can you post your script so I could reproduce the case?

zhangguanheng66 commented 5 years ago

Feel free to re-open the issue if you still have questions.

TinaChen95 commented 5 years ago

I come across the same issue, and it only happen when I define my own unk_token and set min_freq >1 at the same time.

here's the code I use:

SRC = data.Field(lower=True, unk_token="my_unk_token") TGT = data.Field(lower=True)

train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(SRC, TGT))

SRC.build_vocab(train, min_freq=10)

train_iter = data.BucketIterator(dataset=train, batch_size=64, sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))

batch = next(iter(train_iter))

VP-0822 commented 4 years ago

I am still getting this issue. As @TinaChen95 mentioned, min_freq set to 1 works fine. when min_freq > 2, build_vocab(..) builds vocab as per min_freq, but KeyError is thrown while iterating over BucketIterator.

VP-0822 commented 4 years ago

I think so at least for the issue I am facing I figured out that unk_token needs to be passed in ReversibleField constructor even if you want to use default unk_token. That is because ReversibleField uses ' UNK ' as unk_token, while in Vocab we have 'unk' as unk_token. Since there is already open bug #706 so customization is not possible atm.