miso-belica / jusText

Heuristic based boilerplate removal tool
https://pypi.python.org/pypi/jusText
BSD 2-Clause "Simplified" License
719 stars 80 forks source link

Broken stopword list (German) #10

Open schreon opened 10 years ago

schreon commented 10 years ago

https://github.com/miso-belica/jusText/blob/dev/justext/stoplists/German.txt

Most of those words are no stop words. For example "Saison", "Jahrhunderts", "Titel" and many more.

miso-belica commented 10 years ago

Yes, the name stop words is not the best one. The words are actually the most frequent words. Maybe I'll rename the list into frequent words or sth like that in the future.

schreon commented 10 years ago

The first ~100 words may count as stopwords or very frequent words. But those below defenitely do not belong to the most frequent words in German. "Friedrich" is a name. "werden." contains a dot. "französischen" means "french" and defenitely is not one of the most frequent words in German. This list leads to totally unexpected results.

miso-belica commented 10 years ago

Sorry, but I didn't created the lists of the words so I can't tell you from what corpus was the german most frequent words created, but I can imagine some text corpus where words like französischen are the most frequent. I believe you also do. I'll contact original author of the library and ask him for some details.

But what do you suggest anyway? Have you better list of german (or any language) frequent words. You are writing This list leads to totally unexpected results. - Have you some data where to test/evaluate that some another german frequent list is better than that provided by jusText? Can you send me the frequent words list and/or the data? Or would you like just remove some words like französischen from the current frequent list? Have you other suggestions?

I would like to change (and potentially improve) the frequent list but I have no data to evaluate the impact to the algorithm and not sure if it will be better than the current one.

schreon commented 10 years ago

I believe the corpus used by the original author was too small. If generating custom stopword lists from big corpora is an option: Why not the wikipedia dumps? The wikipedias contain so many articles that generating a stopword list based on statistics should be viable. Moreover, the articles are available in XML, so the actual content can be extracted (so disclaimers etc. do not spoil the results).

The unexpected results occured when I was applying jusText to some scraped german news articles. Some shorter paragraphs about Germany ("Deutschland", which is listed in the current jusText stopword list) where filtered out by jusText where they should not have been. Unfortunately, this is some time ago and I don't have the said articles at hand anymore. I will look into it when I have time and come back as soon as I have something reproducable.

So long, the easiest way might be to use the snowball stopword list:

http://snowball.tartarus.org/algorithms/german/stop.txt

Thanks for the time!

miso-belica commented 10 years ago

I'm waiting for Jan's (original author) reply to my e-mail. After the reply I'll try to improve the frequent words list. Thanks for the report :)

DavidNemeskey commented 5 years ago

The Hungarian list is "broken" as well: contains lots of content words (e.g. civil war, nighttime, humanity), proper nouns (Adam, Spain), words with commas or dots attached to them, lowercase / uppercase versions, etc.

BTW. will jusText work if I remove the attached dots/commas from the words, or does it miss words in the text if the exact string between whitespaces is different? Same question with upper/lowercase versions.