stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

English mwt #1378

Closed AngledLuffa closed 3 months ago

AngledLuffa commented 3 months ago

For languages where the MWT words exactly make up the text of the token, build the pieces of the MWT using the text from the original token we are splitting. Should fix a bunch of the errors observed in https://github.com/stanfordnlp/stanza/issues/1371