Inconsistent tokenisation in French

JakobMichiels commented 1 year ago

While tokenising a corpus of French text files, four texts on heart failure seemed to pose a problem for the tokeniser. It ignored typical token boundaries, such as a space, and recognised tokens in single letters and parts of words. In the example below, you can see that the first words are tokenised well, but the quality diminishes severely after that. INSUFFISANCE CARDIAQUE D’ ORIGINE NON INFECTIEUSE EN ZONE TROPICALE : APPROCHE ÉTIOLOGIQU E E T PRINCIPE S THÉRAPEUTIQU ES L es maladi es cardiovasculair es constitue nt un gra ve proble ̀m e de sa nté publ i qu

The tokenisation even goes as far as to consider a long stretch of words as one token at the end of the text. systè mes de so ins condu i sent à une prise en charge souvent retardée.

Do you have any ideas on why this is happening to these files in particular and if this can be solved? As far as I'm aware, this problem does not occur in Dutch or English and is not present in French texts on other topics.

AylaRT commented 1 year ago

Interesting and serious problem. I would also be very interested in seeing a solution, or at least an explanation. L es maladi es cardiovasculair es constitue nt un gra ve proble ̀m e de sa nté publ i que should be Les maladies cardiovasculaires constituent un grave problème de santé publique

minhhdvn commented 10 months ago

Hi @JakobMichiels and @AylaRT , Our tokenizer is a neural-based model, trained on text corpus, so the tokenization results may vary depending on the context, especially in the presense of noise. It may be resolved if the input text is cleaned before handing it to Trankit for further processing. For example, reducing long spaces to single spaces.

nlp-uoregon / trankit

Inconsistent tokenisation in French #68