Closed justHungryMan closed 3 months ago
Hi! Indeed running the LanguageFilter to get metadata['lamguage']
sounds like a good idea.
To just keep all the data, as you said, you can currently set the threshold to 0, but it might actually make sense to change the filter and add an option to pass in all
as a language and still apply thresholds (so that you would keep all languages but still require language_score
to be above a certain threshold.
Adapting word tokenization to other languages is very much planned, as it is an important part of filtering (gopher filters for example) but also of deduplication (we use word tokenization to select our ngrams for dedup).
We are currently collaborating with some researchers from EPFL who are working with multilingual data, I will ping them and ask them about the specific tokenizer they used for Korean :)
Cool! But setting the language_filter's threshold to 0 and getting a language_id value seems weird. To address this, I've made it possible to extract useful language ID related statistics while also allowing for the addition of language_id and language_score in metadata. https://github.com/huggingface/datatrove/pull/136
Please consider this and provide feedback.
Hi! We implemented multiple language support for tokenization recently. We are using Spacy tokenizer for Korean text.
@vsabolcec Nice work, macab in Spacy is known to be a good word_tokenizer for Korean
When do you plan to make a pull request?
added in #147, #187 and #189
Hello,
I'm currently working on text processing that involves filtering (like gopher) in various languages. But now, the default word_tokenization in datatrove filters is based on English, as shown in the snippet below:
As it stands, word_tokenize primarily supports English. However, I've encountered a requirement to process and tokenize Korean or something text, which is not directly supported by NLTK's word_tokenize.
I'm considering an approach that involves identifying the language of the document (doc) prior to word_tokenization, and then using a language-specific tokenizer if the document is in Korean. This approach implies the need for a language identifier or uses LanguageFilter that could determine the document's language (set threshold 0)
I have a couple of questions and requests for advice:
Thank you for your time and assistance.