huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
698 stars 77 forks source link

search function and stopwords #3104

Open stcoats opened 4 days ago

stcoats commented 4 days ago

It seems that the dataset-viewer search function returns no hits if one searches for terms such as “what”, “can”, “which”, and so on. Has the indexing function removed stopwords like this? The rows are returned if one uses the SQL console, but the returned rows in the SQL console don’t give access to the column with audio, for a dataset that includes audio files. Is there a way to search for stop words like this in the default datasets Viewer? It would be really useful if all of the textual content in a column could be searchable.

AndreaFrancis commented 1 day ago

Has the indexing function removed stopwords like this?

Yes, we use a default list of stopwords, which contains 571 words, including "what," "can," and "which." You can view the complete list here: DuckDB English Stopwords List.

But, as @severo mentioned in this discussion, we now support language-specific stemmers for monolingual datasets. Using a default English stopwords list for all languages no longer makes sense. However, DuckDB currently lacks a straightforward way to assign stopwords based on language as it does for stemmers (we would need to seed a stopwords table for non-English datasets). Therefore, for now, the best approach is to set the stopwords parameter to 'none'. cc. @lhoestq

If users want to remove stopwords for specific monolingual datasets (e.g., English), this could be a candidate for a custom configuration at the dataset card level. Keep in mind that removing stopwords like "what," "can," or "which" helps focus on more meaningful terms, improving search relevance. It also reduces the size of the search index and speeds up queries, which is crucial for performance in the Datasets Viewer.