huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.93k stars 136 forks source link

UnicodeDecodeError When Using Korean Tokenizer #277

Open justHungryMan opened 2 weeks ago

justHungryMan commented 2 weeks ago

I encountered a UnicodeDecodeError while using a Korean tokenizer integrated into our data processing pipeline. This issue seems to occur specifically when processing certain types of input data with the tokenizer, as detailed in the error log below:

UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal encoding

This error arises within the imap_unordered function of a multiprocessing pool, suggesting an issue in handling encoding during parallel processing of text data. Below is the relevant portion of the traceback:

│ /home/ubuntu/.local/lib/python3.12/site-packages/datatrove/executor/local.py:133 in run          │
│                                                                                                  │
│   130 │   │   │   completed_lock = mg.Lock()                                                     │
│   131 │   │   │   ctx = multiprocess.get_context(self.start_method)                              │
│   132 │   │   │   with ctx.Pool(self.workers) as pool:                                           │
│ ❱ 133 │   │   │   │   stats = list(                                                              │
│   134 │   │   │   │   │   pool.imap_unordered(                                                   │
│   135 │   │   │   │   │   │   partial(                                                           │
│   136 │   │   │   │   │   │   │   self._launch_run_for_rank,                                     │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ completed_counter = <ValueProxy object, typeid 'Value' at 0x7a7e8b103f50>                    │ │
│ │    completed_lock = <AcquirerProxy object, typeid 'Lock' at 0x7a7e8db923c0>                  │ │
│ │               ctx = <multiprocess.context.ForkServerContext object at 0x7a7e9f75d7f0>        │ │
│ │                 i = 31                                                                       │ │
│ │                mg = <multiprocess.managers.SyncManager object at 0x7a7e8db223c0>             │ │
│ │              pool = <multiprocess.pool.Pool state=TERMINATE pool_size=32>                    │ │
│ │           ranks_q = <AutoProxy[Queue] object, typeid 'Queue' at 0x7a7e905fc380>              │ │
│ │      ranks_to_run = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... +1014]                                │ │
│ │              self = <datatrove.executor.local.LocalPipelineExecutor object at                │ │
│ │                     0x7a7e8dbacf80>                                                          │ │
│ │           skipped = 0                                                                        │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/ubuntu/.local/lib/python3.12/site-packages/multiprocess/pool.py:873 in next                │
│                                                                                                  │
│   870 │   │   success, value = item                                                              │
│   871 │   │   if success:                                                                        │
│   872 │   │   │   return value                                                                   │
│ ❱ 873 │   │   raise value                                                                        │
│   874 │                                                                                          │
│   875 │   __next__ = next                    # XXX                                               │
│   876                                                                                            │
│                                                                                                  │
│ ╭───────────────────────────────────────── locals ──────────────────────────────────────────╮    │
│ │    item = (False, UnicodeDecodeError('utf-16-le', b'\x00\xdc', 0, 2, 'illegal encoding')) │    │
│ │    self = <multiprocess.pool.IMapUnorderedIterator object at 0x7a7e8b175fa0>              │    │
│ │ success = False                                                                           │    │
│ │ timeout = None                                                                            │    │
│ │   value = UnicodeDecodeError('utf-16-le', b'\x00\xdc', 0, 2, 'illegal encoding')          │    │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────╯    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal encoding

To temporarily solve this issue, I implemented a pre-check filter(using lambda filter) that assesses the suitability of the input for the tokenizer, which prevents the process from crashing but does not solve the underlying problem:

def check_korean_tokenizer_pass(doc):
    tokenizer = load_word_tokenizer(Languages.korean)
    try:
        text = doc.text
        words = tokenizer.word_tokenize(text)
        return True
    except:
        return False

This issue seems to originate from the Kiwi library, used by Korean tokenizer. It affects not only my project but potentially other teams.(This issue has already been reported by another internal team using CommonCrawl.)

zengxiaofei commented 2 weeks ago

What is this?

A virus. Do not click that URL.

hynky1999 commented 2 weeks ago

🤔 To me it seems like it's problem with the tokenizer itself as it can't handle arbitrary utf8, which I would expect it to do so. If possible I think this should be resolved in the tokenizer library itself.

However in the meantime (or if they don't want to fix it), we could create a wrapper which will handle this case, seems to me like the cleanest choice.

justHungryMan commented 2 weeks ago

🤔 To me it seems like it's problem with the tokenizer itself as it can't handle arbitrary utf8, which I would expect it to do so. If possible I think this should be resolved in the tokenizer library itself.

However in the meantime (or if they don't want to fix it), we could create a wrapper which will handle this case, seems to me like the cleanest choice.

I completely agree with your opinion. It seems possible to add an option to bypass this issue through the wrapper. However, rather than at the tokenizing stage in the pipeline, shouldn’t this feature to bypass errors when processing "doc" be handled at the executor level? What do you think?

hynky1999 commented 2 weeks ago

By bypassing you mean silently ignoring the error and skipping the document ?

justHungryMan commented 2 weeks ago

By bypassing you mean silently ignoring the error and skipping the document ?

Yes.

guipenedo commented 1 week ago

As seen in #279, it seems that indeed this library might not be stable enough. Spacy has a korean tokenizer, would you be willing to look into whether this might be an alternative solution? If so we can just switch the tokenizer we defined for korean to the spacy one