mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
281 stars 87 forks source link

Stop words not recognized #513

Open AnissaPierre opened 6 years ago

AnissaPierre commented 6 years ago

dan - Indonesian for "and" di - Italian for "of" في - Persian for "and" به- Persian for "to" در - Persian for "door" از - Persian "From" pada- Indonesian for "on" na - Bulgarian for "on" و - Persian for "and" é - Portuguese for "is"

rahulbot commented 6 years ago

Note: these show up as some of the top words in our system if you search for everything since the beginning of time without using a language filter... not a huge priority but it does makes us look bad.

pypt commented 6 years ago

Could you post a sample API query for the issue? Is it word cloud generation?

Also noting that we don’t support Persian, Bulgarian nor Indonesian.

On Thu, 8 Nov 2018 at 04:55 rahulbot notifications@github.com wrote:

Note: these show up as some of the top words in our system if you search for everything since the beginning of time without using a language filter... not a huge priority but it does makes us look bad.

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/513#issuecomment-436773747, or mute the thread https://github.com/notifications/unsubscribe-auth/AALGvaKvhvziEJGXceDLV02E8mNLI2xLks5us0ijgaJpZM4YTH5Q .

-- -- Linas Valiukas Media Cloud

hroberts commented 6 years ago

if we can detect persian, bulgarian, and indonesian, it's worth it to just quickly add some basic stopwords for those languages.

-hal

On Wed, Nov 7, 2018 at 8:46 PM Linas Valiukas notifications@github.com wrote:

Could you post a sample API query for the issue? Is it word cloud generation?

Also noting that we don’t support Persian, Bulgarian nor Indonesian.

On Thu, 8 Nov 2018 at 04:55 rahulbot notifications@github.com wrote:

Note: these show up as some of the top words in our system if you search for everything since the beginning of time without using a language filter... not a huge priority but it does makes us look bad.

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub < https://github.com/berkmancenter/mediacloud/issues/513#issuecomment-436773747 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AALGvaKvhvziEJGXceDLV02E8mNLI2xLks5us0ijgaJpZM4YTH5Q

.

--

Linas Valiukas Media Cloud

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/513#issuecomment-436854132, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT-3EI1r-CbYIK7AMANhPffxiTA7Kks5us5sNgaJpZM4YTH5Q .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University