nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

em dash character crashes French pipeline #42

Open pa-nlp opened 2 years ago

pa-nlp commented 2 years ago

I tested trankit with the base and large models using the French pipeline and the em dash (character unicode 8212) causes the model to crash. The online demo seems to have the same problem. A quick replace on the input string to change to an hyphen avoid this issue. I did not test the three other types of dashes, nor with other languages.