Mismatch in pre-processed wikitext-103 corpus and using pre-trained tokenizer for TransfoXLLMHeadModel

Akhila-Yerukola commented 5 years ago

In examples/run_transfo_xl.py, the pre-processed wikitext-103 corpus is loaded using:

corpus = TransfoXLCorpus.from_pretrained(args.model_name) ` Example of pre-processed batch converted to tokens:

['', '=', 'Homarus', 'gammarus', '=', '', '', 'Homarus', 'gammarus', ',', 'known', 'as', 'the', 'European', 'lobster', 'or', 'common', 'lobster', ',', 'is', 'a', 'species', 'of', 'clawed', 'lobster', 'from', 'the', 'eastern', 'Atlantic', 'Ocean', ',', 'Mediterranean', 'Sea', 'and', 'parts', 'of', 'the', 'Black', 'Sea', '.', 'It', 'is', 'closely', 'related', 'to', 'the', 'American', 'lobster', ',', 'H.', 'americanus', '.', 'It', 'may', 'grow', 'to', 'a', 'length', 'of', '60']

Evaluating the TransfoXLLMHeadModel model on this corpus gives a ppl of ~18.

However when I use the pre-trained TransfoXLTokenizer for wikitext-103 using: tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103'), there is a mismatch in the tokenizations.

Example of using pre-trained tokenizer to tokenize wikitext-103:

['', '=', 'Homarus', 'gammarus', '=', '', '', 'Homarus', 'gammarus', ',', 'known', 'as', 'the', 'European', 'lobster', 'or', 'common', 'lobster', ',', 'is', 'a', 'species', 'of', 'clawed', 'lobster', 'from', 'the', 'eastern', 'Atlantic', 'Ocean', ',', 'Mediterranean', 'Sea', 'and', 'parts', 'of', 'the', 'Black', 'Sea', '.', 'It', 'is', 'closely', 'related', 'to', 'the', 'American', 'lobster', ',', 'H', '.', 'americanus', '.', 'It', 'may', 'grow', 'to', 'a', 'length', 'of']

Here, H. is being split, whereas the pre-processed version has it as a single token. Evaluating the TransfoXLLMHeadModel model on this version of the corpus gives a ppl of ~29.

Could you please help me understand why there is a mismatch?

thomwolf commented 5 years ago

You are right, thanks for catching this.

This behavior is inherited from the BERT Tokenizer but the TransformerXL Tokenizer should behave differently (this line which split on punctuation is not present in the original tokenizer of Transformer XL here).

I'll check there no other differences, add a test on this and fix this in the next release.

SeunghyunSEO commented 2 years ago

@thomwolf Hello thomwolf, first of all very ty for your library. I'm reproducing transformer-xl benchmark performance using hugginface Tokenizer, but I think there is still mismatch. I just want to ask you if its solved. i would very much appreciated if you reply this :)

huggingface / transformers

Mismatch in pre-processed wikitext-103 corpus and using pre-trained tokenizer for TransfoXLLMHeadModel #466