huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.93k stars 136 forks source link

Hangs Due to `std::length_error` in Data Processing Pipeline #279

Open justHungryMan opened 1 week ago

justHungryMan commented 1 week ago
2024-09-01 15:00:25.122 | INFO     | datatrove.executor.local:run:120 - Skipping 4095 already completed tasks
2024-09-01 15:00:25.772 | INFO     | datatrove.utils.logging:add_task_logger:58 - Launching pipeline for rank=3925
2024-09-01 15:00:25.772 | INFO     | datatrove.utils.logging:log_pipeline:90 -
--- πŸ› οΈ  PIPELINE πŸ› 
πŸ“– - READER: 🐿 Jsonl
πŸ”» - FILTER: πŸ‘€ Lambda
πŸ”» - FILTER: πŸ‘― Gopher Repetition
πŸ”» - FILTER: πŸ₯‡ Gopher Quality
πŸ”» - FILTER: πŸ‘€ Lambda
πŸ”» - FILTER: β›° C4 Quality
πŸ”» - FILTER: 🍷 FineWeb Quality
πŸ’½ - WRITER: 🐿 Jsonl
2024-09-01 15:00:27.077 | INFO     | datatrove.pipeline.readers.base:read_files_shard:191 - Reading input file 03925.jsonl.gz, 1/2
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create

The only solution I've found is to abandon the task file and forcibly stop the executor. The error message and location are not specific enough to identify the source of the problem. Any insights or suggestions on how to handle this error more gracefully would be appreciated.

SinclairCoder commented 1 week ago

Any better () infra that can handle the whole CC processing (i.e., reading, extracting, and writing) than Datatrove lol?

guipenedo commented 1 week ago

Can you share the full script? Curious particularly about the lambda blocks

justHungryMan commented 1 week ago

Hi, @guipenedo

I can only show you about lambda function.

First Lambda for prechecking issues #277

def check_korean_tokenizer_pass(doc):
    tokenizer = load_word_tokenizer(Languages.korean)
    try:
        text = doc.text
        words = tokenizer.word_tokenize(text)

        return True
    except:
        return False   

Second Lambda for substituting max_non_alpha_words_ratio in GopherQualityFilter In many time, Korean use Hangul(korean), chinese (but related to Korean Hanja) and English, we just filter documents under min_non_korean_words_ratio

def filter_non_korean_words_ratio(doc):
        text = doc.text
        words = tokenizer.word_tokenize(text)
        n_words = len(words)

        n_korean_words = sum(bool(korean_pattern.search(word)) for word in words)

        if min_non_korean_words_ratio and (n_korean_words / n_words) < min_non_korean_words_ratio:
            return False

        return True
guipenedo commented 1 week ago

This seems to be an issue with the korean tokenizer, if you look at the project https://github.com/bab2min/kiwipiepy a good chunk of it is C++, which would make sense given the C++ error you are getting. I imagine one of your documents is causing this issue with the tokenizer, can you try just tokenizing all the documents in that specific file directly? Not sure if this error is try...catchable, but if it is can you then try to print the document that triggers it? I tried taking a look at the project's issues but they're mostly in korean, maybe you'll have better luck

justHungryMan commented 1 week ago

It does seem that the issue stems from the kiwipiepy tokenizer as you mentioned. However, the main problem is that no error location is provided in the message. Interestingly, despite having an initial filtering step via a lambda function to preemptively catch such errors, it’s unclear whether this error is emanating from that lambda or another filtering function.

I’m working with commoncrawl data, not a private dataset, so I suspect others might encounter similar issues. Given that kiwipiepy is the default tokenizer module in datatrove, perhaps we should consider implementing a timeout mechanism similar to what’s used in trafilatura for filtering. Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (https://github.com/huggingface/datatrove/issues/277#issuecomment-2314771896) What do you think?

guipenedo commented 1 week ago

The reason why there isn't any useful error message is likely because some external C++ code is being called from that library, you would normally have an error message for other types of errors. If kiwipiepy is not stable enough we can consider using the spacy korean tokenizer instead. I don't speak korean, but maybe you have some insight on which one might be better?

Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (https://github.com/huggingface/datatrove/issues/277#issuecomment-2314771896) What do you think?

not entirely clear to me how this would work, some blocks rely on order implicitly (dedup mostly) and you'd basically need to try catch every single run() method for the others. If you have a possible idea for a working implementation that would generalize well I'd love to hear it

justHungryMan commented 1 week ago

The reason why there isn't any useful error message is likely because some external C++ code is being called from that library, you would normally have an error message for other types of errors. If kiwipiepy is not stable enough we can consider using the spacy korean tokenizer instead. I don't speak korean, but maybe you have some insight on which one might be better?

Additionally, incorporating a bypass feature for problematic documents could help avoid getting stuck. (#277 (comment)) What do you think?

not entirely clear to me how this would work, some blocks rely on order implicitly (dedup mostly) and you'd basically need to try catch every single run() method for the others. If you have a possible idea for a working implementation that would generalize well I'd love to hear it

When you mention that some blocks rely on order implicitly, I understand it to mean that there’s a dependency on the order within blocks, like with min-hash deduplication. My suggestion involves dropping documents at the document level within a block if an issue arises, but I’m not entirely sure which part of the datatrove code would need to be modified for this approach (or if the current architecture even supports such modifications).

I will reach out to the kiwipiepy repository to discuss this issue further. From what I understand, kiwipiepy generally performs better with Korean text compared to spacy’s tokenizer.

guipenedo commented 1 week ago

Thank you for clarifying, will wait to hear back from the kiwipiepy maintainers then

justHungryMan commented 1 week ago

The error analysis has revealed that the memory issue occurs when processing spam texts that consist of over 26,000 characters without any spaces. This seems to be what triggers the problem.

The author has indicated that they are preparing a patch to resolve this issue. πŸ€—