utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

use_fast=True not working after upgrade to transformers v2.10.0 #222

Closed lingdoc closed 4 years ago

lingdoc commented 4 years ago

On upgrading to transformers==2.10.0, when instantiating a tokenizer, the vocabulary file is not saved after training. A TypeError is returned when trying to save the tokenizer after training (i.e. on calling data.tokenizer.save_pretrained(path) in learner_util.py).

I've traced this to line 367 in data_cls.py: https://github.com/kaushaltrivedi/fast-bert/blob/77f09adc7bc2706e0c7e3b8cdd09cb6ddd66ae28/fast_bert/data_cls.py#L367

if I comment out the use_fast argument, the tokenizer file can be saved correctly, i.e: tokenizer = AutoTokenizer.from_pretrained(tokenizer)#, use_fast=True)

lingdoc commented 4 years ago

I figured out that the path used when saving the tokenizer in learner_util has to be explicitly cast to str in order not to throw an error. https://github.com/kaushaltrivedi/fast-bert/blob/77f09adc7bc2706e0c7e3b8cdd09cb6ddd66ae28/fast_bert/learner_util.py#L130-L131 Odd that the regular tokenizer can handle the generator object generated by Path() but the fast version can't, so I have a bug report open (https://github.com/huggingface/transformers/issues/4541) to see how they want to deal with it in Transformers.

SC4RECOIN commented 4 years ago

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True) seems to be working fine for me with transformers==2.11.0

utterworks commented 4 years ago

The issue is with tokenizers incompatibility with transformers. I have downgraded tokenizers to 0.7 which seems to work. v1.8.1 has fixed this. Please try it out and close this if it works.

lingdoc commented 4 years ago

Ok, finally managed to try it out, and it works nicely.