Open pgulley opened 4 months ago
No Vietnamese stopwords might be part of the issue, but probably doesn't cover this
It may just be because I run queries against all stories when looking at progress running historical backfills, and we have some REALLY spammy .vn sources!!
Was traipsing thru mc-providers and noticed that there is no vi_stop_words.txt file in https://github.com/mediacloud/mc-providers/tree/main/mc_providers/language/
@philbudne Noted, in investigating the status of re-indexing data from 2022, that the top-terms for a query from 2022-01-01 to 2022-12-31 seems to be entirely populated with Vietnamese words- despite vietnamese not being in the top 10 languages represented!