Closed Akhila-Yerukola closed 5 years ago
You are right, thanks for catching this.
This behavior is inherited from the BERT Tokenizer but the TransformerXL Tokenizer should behave differently (this line which split on punctuation is not present in the original tokenizer of Transformer XL here).
I'll check there no other differences, add a test on this and fix this in the next release.
@thomwolf Hello thomwolf, first of all very ty for your library. I'm reproducing transformer-xl benchmark performance using hugginface Tokenizer, but I think there is still mismatch. I just want to ask you if its solved. i would very much appreciated if you reply this :)
In
examples/run_transfo_xl.py
, the pre-processed wikitext-103 corpus is loaded using:corpus
= TransfoXLCorpus.from_pretrained(args.model_name) ` Example of pre-processed batch converted to tokens:Evaluating the TransfoXLLMHeadModel model on this corpus gives a ppl of ~18.
However when I use the pre-trained
TransfoXLTokenizer
for wikitext-103 using:tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
, there is a mismatch in the tokenizations.Example of using pre-trained tokenizer to tokenize wikitext-103:
Here,
H.
is being split, whereas the pre-processed version has it as a single token. Evaluating the TransfoXLLMHeadModel model on this version of the corpus gives a ppl of ~29.Could you please help me understand why there is a mismatch?