nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

some french stopwords are wrong (punkt) #206

Open sylvan-ermit opened 7 months ago

sylvan-ermit commented 7 months ago

first, there are a lot of old/literary conjugations of the auxiliary verbs. it's a lot of computation for words rarely used in modern french. but the problem is really that some words are wrong. été is the past participle of être alright, but it's also the noun summer, so you probably don't want it as a stopword. été as past participle is invariable so the words étée and étées do not exist and étés exists only as the plural of summer. it's almost the same for the present participle étant: invariable but used as an adjective and a noun in philosophy, so either the word does not exist or you don't want to delete it. as and fut are nouns too. edit: forgot some other polysemic entries: suis, est, sommes and avions

stevenbird commented 3 months ago

Has anyone published a definitive list of stopwords for French?

ekaf commented 3 months ago

NLTK's stopwords lists come from the Snowball project, but someone added aberrant forms like "ayantes" to the French list. An easy solution could be to just go back to the original list.

A definitive list is not likely, because the criteria vary according to the purpose of the analysis: sometimes you don't want to entirely discard "to be or not to be".

Asking chatgpt-4o for a "definitive list" produced this: fr-stopwords-4o-definitive.txt

When asked if a definitive list can ever exist, it explains that even though they may not be definitive, these lists serve as a practial tool, and that they often need to be adapted for their purpose: fr-stopwords-4o-exist.txt