oscar-project / ungoliant

:spider: The pipeline for the OSCAR corpus
https://oscar-corpus.com
Apache License 2.0
162 stars 14 forks source link

[BUG] UnexpectedEof While running Ungoliant Pipeline #130

Closed nattkorat closed 9 months ago

nattkorat commented 9 months ago

UnexpectedEof While running Ungoliant Pipeline I have tried to run the pipeline to extract the languages from the CC wet file which is already downloaded (only 25 files).

Step that produce error Steps to reproduce the behavior:

  1. Saved the CC index to a paths file 'cc-index.paths'
  2. Run Ungoliant download 'ungoliant download -t 10 \<paths> \<dst>'
  3. Run Ungoliant pipeline 'ungoliant pipeline --lid-path \<model path> \<wet dir> \<dst>'
  4. See the error
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))
[2024-02-20T08:03:42Z ERROR ungoliant::pipelines::oscardoc::pipeline] ReadData(Kind(UnexpectedEof))

I have done debugging all the files and found that one file that caused the error couldn't be unzipped.

Desktop:

Uinelj commented 9 months ago

Did you try to re-download the corrupted shard? Also: what happens if you run the pipeline without including the corrupted shard?

nattkorat commented 9 months ago

I will try to re-download the corrupted file again. Without that corrupted file, it works just fine!

nattkorat commented 9 months ago

After re-downloading the corrupted shard, there was no issue. We need to handle infinite loading when another corrupted file occurs.