Closed AndreaFrancis closed 4 days ago
the librarian bot is often run after the duckdb index is done no ? not sure it would help a lot vs making things more complex
the librarian bot is often run after the duckdb index is done
But a dataset might also be updated at some point, in this case the librarian bot's PR should be there, am I right?
To me, this doesn't add a lot of complexity, I think it's very nice and reasonable - why not using this "free" opportunity to make the search tool more precise and reliable .
Alternatively, to make it work independently of if there is a librarian bot's PR open, would it make sense to reuse the bot's code and create a new processing step for language detection? It uses fasttext over 1000 rows so shouldn't be computationally intensive. But I don't know the number of cases it would cover and if it's worth the effort. How many datasets do not have language metadata tag?
Most datasets are quite static, that's why I'm a bit concerned with using the librarian bot info that might not be available.
Having automatic language tags on the viewer side does fix this issue though (ideally with a call to action in the UI to open a PR to fix the language if it did a mistake). And it would help with filtering by language which could end up even more useful than the search improvement itself maybe ?
would it make sense to reuse the bot's code and create a new processing step for language detection?
I also agree it would make more sense to have a processing step using the librarian bot code. Even if we don't use these data on the UI, it would give us an idea about the dataset's language statistics. If we aim to democratize from the data side, how can we measure our goal if we don't have data to help us see the languages we can cover?
would it make sense to reuse the bot's code and create a new processing step for language detection?
This would be very nice, IMO. This came up a bit before, but I think having a UI to suggest metadata was automatically generated, which could then be fixed/accepted as authoritative, could be very cool. This could potentially be extended in the future to add other steps.
This PR is still in progress, but it is a suggestion about how to get the language from open PRs from the librarian bot. Please let me know what you think.
Pending: tests, refactor. For https://huggingface.co/datasets/Osumansan/data-poison/commit/22432ba97e6c559891bd82ca084496a7f8a6699f.diff , it was able to identify 'ko' language, but since it is not supported by DuckDB, it assigns 'none' as stemmer.
cc. @davanstrien