Closed lingdoc closed 4 years ago
I figured out that the path
used when saving the tokenizer in learner_util
has to be explicitly cast to str
in order not to throw an error.
https://github.com/kaushaltrivedi/fast-bert/blob/77f09adc7bc2706e0c7e3b8cdd09cb6ddd66ae28/fast_bert/learner_util.py#L130-L131
Odd that the regular tokenizer can handle the generator object generated by Path()
but the fast
version can't, so I have a bug report open (https://github.com/huggingface/transformers/issues/4541) to see how they want to deal with it in Transformers.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)
seems to be working fine for me with transformers==2.11.0
The issue is with tokenizers incompatibility with transformers. I have downgraded tokenizers to 0.7 which seems to work. v1.8.1 has fixed this. Please try it out and close this if it works.
Ok, finally managed to try it out, and it works nicely.
On upgrading to
transformers==2.10.0
, when instantiating a tokenizer, the vocabulary file is not saved after training. ATypeError
is returned when trying to save the tokenizer after training (i.e. on callingdata.tokenizer.save_pretrained(path)
inlearner_util.py
).I've traced this to line 367 in
data_cls.py
: https://github.com/kaushaltrivedi/fast-bert/blob/77f09adc7bc2706e0c7e3b8cdd09cb6ddd66ae28/fast_bert/data_cls.py#L367if I comment out the
use_fast
argument, the tokenizer file can be saved correctly, i.e:tokenizer = AutoTokenizer.from_pretrained(tokenizer)#, use_fast=True)