Open indexxlim opened 3 years ago
use_fast = False
is not really a viable option, because it doesn't implement return_offsets_mapping
. Parsing operates over words, while pre-trained use subwords with a bunch of unicode substitution/normalization rules. The parser relies on having the tokenizer provide a mapping between subwords and character positions in the original string. "Slow" huggingface tokenizers don't implement this feature, and trying to reconstruct alignments after-the-fact is extremely error-prone due to all of the text normalization involved.
If you're using T5-based English parsers and want a solution just for yourself, you can probably modify the tokenization code to use the original sentencepiece library instead of huggingface. But I don't plan on adding such a solution to this repository, because it's not general-purpose and only works for a limited set of pre-trained models. You could also try hacking retokenization.py
to have multiple tokenizer copies in thread-local storage.
There is currently one bug when using fast tokenizer. If I run it to multi-thread, a bug will occur, so could you add the option use_fast = False that doesn't use fast tokenizer?
https://github.com/huggingface/tokenizers/issues/537