miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.
https://miso-belica.github.io/sumy/
Apache License 2.0
3.46k stars 525 forks source link

Add Swedish stopwords #195

Closed RobertMartinis closed 8 months ago

RobertMartinis commented 1 year ago

Adds Swedish stop words for the stemmer.

miso-belica commented 11 months ago

Hello, thank you for the PR. Can you please explain some words to me? I don't speak Swedish but after translating those words I think some are suspicious and not a suitable for stopwords.

Stopwords aderton (Not a real stopword) adertonde (Not a real stopword) adjö (goodbye) aldrig (never) alla (all) allas (everyone's) allt (everything) alltid (always) alltså (therefore) andra (other) andras (others') annan (another) annat (another) artonde (eighteenth) (Not a real stopword) arton (eighteen) (Not a real stopword) att (to) av (of) bakom (behind) bara (only) behöva (need) behövas (needed) behövde (needed) behövt (needed) beslut (decision) beslutat (decided) beslutit (decided) bland (among) blev (became) bli (become) blir (becomes) blivit (become) bort (away) borta (away) bra (good) bäst (best) bättre (better) båda (both) bådas (both's) (Not a real stopword) dag (day) dagar (days) dagarna (the days) dagen (the day) de (they, the) del (part) delen (the part) dem (them) den (the) denna (this) deras (their) dess (its) dessa (these) det (it) detta (this) dig (you, object form) din (your) dina (your) dit (there) ditt (your) dock (though) dom (they) (informal) du (you) där (there) därför (therefore) då (then) e (and) (Not a real stopword) efter (after) eftersom (because) ej (not) (Not a real stopword) elfte (eleventh) (Not a real stopword) eller (or) elva (eleven) (Not a real stopword) emot (against) en (a, an, one) enkel (simple) enkelt (simply) enkla (simple) (Not a real stopword) enligt (according to) ens (even) er (your) era (yours) (Not a real stopword) ers (yours) (Not a real stopword) ert (yours) (Not a real stopword) ett (a, an, one) ettusen (one thousand) (Not a real stopword) fanns (was, were) (Not a real stopword) fem (five) (Not a real stopword) femte (fifth) (Not a real stopword) femtio (fifty) (Not a real stopword) femtionde (fifty) (Not a real stopword) femton (fifteen) (Not a real stopword) femtonde (fifteenth) (Not a real stopword) fick (got) (Not a real stopword) fin (nice) (Not a real stopword) finnas (exist) (Not a real stopword) finns (exist) (Not a real stopword) fjorton (fourteen) (Not a real stopword) fjortonde (fourteenth) (Not a real stopword) fjärde (fourth) (Not a real stopword) fler (more) (Not a real stopword) flera (several) (Not a real stopword)
RobertMartinis commented 9 months ago

Hi!

I used the spaCy package's list of swedish stop words as my source, and i believe this list is widely used for Swedish language processing.

miso-belica commented 8 months ago

Yes, I know other projects use a lot of stopwords. I always tried to use only the general ones without any meaning at all. I translated some as "eighteen" or "goodbye" which I would omit. But because I have no knowledge of Swedish language we can merge. Maybe someone improves it if needed. Can I ask you for the test too to make sure the language works with all component?

RobertMartinis commented 8 months ago

Yes, I know other projects use a lot of stopwords. I always tried to use only the general ones without any meaning at all. I translated some as "eighteen" or "goodbye" which I would omit. But because I have no knowledge of Swedish language we can merge. Maybe someone improves it if needed. Can I ask you for the test too to make sure the language works with all component?

I see, i've removed some stop words now that i belive were not as suitable. I have also added a test for the Swedish stemmer aswell.

RobertMartinis commented 8 months ago

Can we merge, or is it something more that needs to be addressed?

JakobPaulsson commented 8 months ago

Great contribution, good job @RobertMartinis!