WIP: Try to get languages from librarian bot PR for FTS

huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub

https://huggingface.co/docs/datasets-server

Apache License 2.0

642 stars 65 forks source link

WIP: Try to get languages from librarian bot PR for FTS #2968

Closed AndreaFrancis closed 4 days ago

AndreaFrancis commented 6 days ago

This PR is still in progress, but it is a suggestion about how to get the language from open PRs from the librarian bot. Please let me know what you think.

Pending: tests, refactor. For https://huggingface.co/datasets/Osumansan/data-poison/commit/22432ba97e6c559891bd82ca084496a7f8a6699f.diff , it was able to identify 'ko' language, but since it is not supported by DuckDB, it assigns 'none' as stemmer.

cc. @davanstrien

lhoestq commented 6 days ago

the librarian bot is often run after the duckdb index is done no ? not sure it would help a lot vs making things more complex

polinaeterna commented 5 days ago

the librarian bot is often run after the duckdb index is done

But a dataset might also be updated at some point, in this case the librarian bot's PR should be there, am I right?

To me, this doesn't add a lot of complexity, I think it's very nice and reasonable - why not using this "free" opportunity to make the search tool more precise and reliable .

Alternatively, to make it work independently of if there is a librarian bot's PR open, would it make sense to reuse the bot's code and create a new processing step for language detection? It uses fasttext over 1000 rows so shouldn't be computationally intensive. But I don't know the number of cases it would cover and if it's worth the effort. How many datasets do not have language metadata tag?

lhoestq commented 5 days ago

Most datasets are quite static, that's why I'm a bit concerned with using the librarian bot info that might not be available.

Having automatic language tags on the viewer side does fix this issue though (ideally with a call to action in the UI to open a PR to fix the language if it did a mistake). And it would help with filtering by language which could end up even more useful than the search improvement itself maybe ?

AndreaFrancis commented 5 days ago

would it make sense to reuse the bot's code and create a new processing step for language detection?

I also agree it would make more sense to have a processing step using the librarian bot code. Even if we don't use these data on the UI, it would give us an idea about the dataset's language statistics. If we aim to democratize from the data side, how can we measure our goal if we don't have data to help us see the languages we can cover?

davanstrien commented 5 days ago

would it make sense to reuse the bot's code and create a new processing step for language detection?

This would be very nice, IMO. This came up a bit before, but I think having a UI to suggest metadata was automatically generated, which could then be fixed/accepted as authoritative, could be very cool. This could potentially be extended in the future to add other steps.