harvardnlp / annotated-transformer

An annotated implementation of the Transformer paper.
http://nlp.seas.harvard.edu/annotated-transformer
MIT License
5.68k stars 1.23k forks source link

Problem about the vocabulary of iwslt.pt #42

Closed usertomlin closed 2 years ago

usertomlin commented 5 years ago

There's a problem about the vocabulary of https://s3.amazonaws.com/opennmt-models/iwslt.pt : After loading the model by model = torch.load("iwslt.pt"), it can be found that size of the English vocabulary is 36321

(iwslt.pt : size 36321; built from datasets.IWSLT: 36327)

(0): Embeddings(
      (lut): Embedding(36321, 512)
 )

However, after building the TGT.vocab by

    MAX_LEN = 100
    train, val, test = datasets.IWSLT.splits(
        exts=('.de', '.en'), fields=(SRC, TGT), 
        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
            len(vars(x)['trg']) <= MAX_LEN)
    MIN_FREQ = 2
    SRC.build_vocab(train.src, min_freq=MIN_FREQ)
    TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

, it's found that the size of the English vocabulary is

print("vocab_size = ", len(TGT.vocab) )  # 36327
print("vocab_size = ", len(TGT.vocab.itos) ) # 36327
print("vocab_size = ", len(TGT.vocab.stoi) ) # 36327

The codes for building vocab are almost identical. Perhaps datasets.IWSLT has changed slightly so that the vocab differs slightly.

Though translation results of the model on the 'valid_iter' seem quite correct, the model loaded from 'iwslt.pt' still cannot fully work since the vocabulary currently built from datasets.IWSLT does not match the vocabulary size.

I am building a model based on the iwslt.pt so I need the English vocabulary. How or where to obtain the correct English vocabulary of 'https://s3.amazonaws.com/opennmt-models/iwslt.pt' (size 36321) ?

marwash25 commented 4 years ago

Hello, Did you end up solving this problem? I have the same issue with len(TGT.vocab) = 36323.