huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.94k stars 137 forks source link

Enhancing word_tokenize (like nltk) Support for Multiple Languages #135

Closed justHungryMan closed 3 months ago

justHungryMan commented 5 months ago

Hello,

I'm currently working on text processing that involves filtering (like gopher) in various languages. But now, the default word_tokenization in datatrove filters is based on English, as shown in the snippet below:

from nltk.tokenize import word_tokenize

text = doc.text
words = word_tokenize(text)  # TODO we should use language id filter

As it stands, word_tokenize primarily supports English. However, I've encountered a requirement to process and tokenize Korean or something text, which is not directly supported by NLTK's word_tokenize.

I'm considering an approach that involves identifying the language of the document (doc) prior to word_tokenization, and then using a language-specific tokenizer if the document is in Korean. This approach implies the need for a language identifier or uses LanguageFilter that could determine the document's language (set threshold 0)

I have a couple of questions and requests for advice:

  1. Would it be advisable to implement a LanguageFilter before some filtering step for language-specific tokenizer, ensuring that each document's language metadata is available?
  2. I'm aiming to develop a flexible and efficient solution that can handle multiple languages gracefully, with a particular focus on adding support for Korean in the near term. Any insights, recommendations, or examples of similar implementations would be greatly appreciated. I'd like to know what your plans are regarding language ids.

Thank you for your time and assistance.

guipenedo commented 5 months ago

Hi! Indeed running the LanguageFilter to get metadata['lamguage'] sounds like a good idea. To just keep all the data, as you said, you can currently set the threshold to 0, but it might actually make sense to change the filter and add an option to pass in all as a language and still apply thresholds (so that you would keep all languages but still require language_score to be above a certain threshold.

Adapting word tokenization to other languages is very much planned, as it is an important part of filtering (gopher filters for example) but also of deduplication (we use word tokenization to select our ngrams for dedup).

We are currently collaborating with some researchers from EPFL who are working with multilingual data, I will ping them and ask them about the specific tokenizer they used for Korean :)

justHungryMan commented 5 months ago

Cool! But setting the language_filter's threshold to 0 and getting a language_id value seems weird. To address this, I've made it possible to extract useful language ID related statistics while also allowing for the addition of language_id and language_score in metadata. https://github.com/huggingface/datatrove/pull/136

Please consider this and provide feedback.

vsabolcec commented 5 months ago

Hi! We implemented multiple language support for tokenization recently. We are using Spacy tokenizer for Korean text.

justHungryMan commented 5 months ago

@vsabolcec Nice work, macab in Spacy is known to be a good word_tokenizer for Korean

When do you plan to make a pull request?

guipenedo commented 3 months ago

added in #147, #187 and #189