nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Inconsistent tokenisation in French #68

Closed JakobMichiels closed 10 months ago

JakobMichiels commented 1 year ago

While tokenising a corpus of French text files, four texts on heart failure seemed to pose a problem for the tokeniser. It ignored typical token boundaries, such as a space, and recognised tokens in single letters and parts of words. In the example below, you can see that the first words are tokenised well, but the quality diminishes severely after that. INSUFFISANCE CARDIAQUE D’ ORIGINE NON INFECTIEUSE EN ZONE TROPICALE : APPROCHE ÉTIOLOGIQU E E T PRINCIPE S THÉRAPEUTIQU ES L es maladi es cardiovasculair es constitue nt un gra ve proble ̀m e de sa nté publ i qu

The tokenisation even goes as far as to consider a long stretch of words as one token at the end of the text. systè mes de so ins condu i sent à une prise en charge souvent retardée.

Do you have any ideas on why this is happening to these files in particular and if this can be solved? As far as I'm aware, this problem does not occur in Dutch or English and is not present in French texts on other topics.

AylaRT commented 1 year ago

Interesting and serious problem. I would also be very interested in seeing a solution, or at least an explanation. L es maladi es cardiovasculair es constitue nt un gra ve proble ̀m e de sa nté publ i que should be Les maladies cardiovasculaires constituent un grave problème de santé publique

minhhdvn commented 10 months ago

Hi @JakobMichiels and @AylaRT , Our tokenizer is a neural-based model, trained on text corpus, so the tokenization results may vary depending on the context, especially in the presense of noise. It may be resolved if the input text is cleaned before handing it to Trankit for further processing. For example, reducing long spaces to single spaces.