spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.5k stars 654 forks source link

Equivalent to nltk.corpus stopwords #947

Open Utopiah opened 2 years ago

Utopiah commented 2 years ago

Hi, I'm just learning about the project and it's pretty amazing. I tinkered with NTLK and Gensim before but this is so convenient to explore and embed on a page. Learning with Observable notebooks is also great!

That being said I end up for a lot of noise in my selection. I tried a bit of normalize() and remove() with encouraging results. Still, I'm quite surprised that when I search in this repository I don't seem to find stop words.

This made me wonder, is this the "wrong" way in this context? Is the philosophy of compromise not to rely on such lists?

PS: I apologize for hijacking issues but is there a forum/chat/platform for discussions on using compromise that would a better place? I have other questions like using .tfidf() on .ngrams() but I don't make to create noise here.

spencermountain commented 2 years ago

hey Fabien, you're talking about the results of the wikipedia plugin right?

Yeah, super noisy. it really needs a lot of work. Yeah, i was using a stop-list here but that was just me eyeballing it. It could really use a PR, if you want to take a swing at it.

To do it properly, we should also add (some!) wikipedia redirects. I held-off because the results were still so rowdy. cheers