mediacloud / news-search-api

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).
https://mediacloud.org
GNU Affero General Public License v3.0
1 stars 3 forks source link

Top words for early 2022 are all Vietnamese? #78

Open pgulley opened 4 months ago

pgulley commented 4 months ago

image

@philbudne Noted, in investigating the status of re-indexing data from 2022, that the top-terms for a query from 2022-01-01 to 2022-12-31 seems to be entirely populated with Vietnamese words- despite vietnamese not being in the top 10 languages represented!

pgulley commented 4 months ago

No Vietnamese stopwords might be part of the issue, but probably doesn't cover this

philbudne commented 2 weeks ago

It may just be because I run queries against all stories when looking at progress running historical backfills, and we have some REALLY spammy .vn sources!!

philbudne commented 1 week ago

Was traipsing thru mc-providers and noticed that there is no vi_stop_words.txt file in https://github.com/mediacloud/mc-providers/tree/main/mc_providers/language/