nikitakit / self-attentive-parser

High-accuracy NLP parser with models for 11 languages.
https://parser.kitaev.io/
MIT License
861 stars 153 forks source link

RuntimeError: Already borrowed #82

Open indexxlim opened 3 years ago

indexxlim commented 3 years ago

There is currently one bug when using fast tokenizer. If I run it to multi-thread, a bug will occur, so could you add the option use_fast = False that doesn't use fast tokenizer?

https://github.com/huggingface/tokenizers/issues/537

nikitakit commented 3 years ago

use_fast = False is not really a viable option, because it doesn't implement return_offsets_mapping. Parsing operates over words, while pre-trained use subwords with a bunch of unicode substitution/normalization rules. The parser relies on having the tokenizer provide a mapping between subwords and character positions in the original string. "Slow" huggingface tokenizers don't implement this feature, and trying to reconstruct alignments after-the-fact is extremely error-prone due to all of the text normalization involved.

If you're using T5-based English parsers and want a solution just for yourself, you can probably modify the tokenization code to use the original sentencepiece library instead of huggingface. But I don't plan on adding such a solution to this repository, because it's not general-purpose and only works for a limited set of pre-trained models. You could also try hacking retokenization.py to have multiple tokenizer copies in thread-local storage.