oscar-project / ungoliant

:spider: The pipeline for the OSCAR corpus
https://oscar-corpus.com
Apache License 2.0
162 stars 14 forks source link

[BUG] corrupt deflate stream #131

Open kargaranamir opened 7 months ago

kargaranamir commented 7 months ago

Describe the Bug When running the Ungoliant pipeline, everything proceeds smoothly initially as the JSONL files for each language are built. However, after a couple of hours, an error suddenly appears in the logs, and thereafter, only this error persists. I am curious as to why this occurs and whether it could be resolved by skipping the problematic inputs.

[2024-03-27T23:49:00Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Custom { kind: InvalidInput, error: "corrupt deflate stream" })

To Reproduce Nothing specific to mention, just the routine: downloading and pipelining.

Expected Behavior The expected behavior is for the pipeline to function as it did earlier or to skip the corrupt inputs.

Screenshots

at first:

Screenshot 2024-03-28 at 12 56 51 AM

later:

Screenshot 2024-03-28 at 12 55 51 AM

Desktop (Please Complete the Following Information):

uname -a
Linux delta 5.14.21-150500.55.36-default #1 SMP PREEMPT_DYNAMIC Tue Oct 31 08:37:43 UTC 2023 (e7a2e23) x86_64 x86_64 x86_64 GNU/Linux