Open dan-the-meme-man opened 2 years ago
Hi @dan-the-meme-man, thanks for reporting.
We are investigating a similar issue but with Beam+Dataflow (instead of Beam+Flink):
In order to go deeper into the root cause, we need as much information as possible: logs from the main process + logs from the workers are very informative.
In the case of the issue with Beam+Dataflow, the logs from the workers report an out of memory issue.
As I continued working on this today, I came to suspect that it is in fact an out of memory issue - I have a few more notebooks that I've left running, and if they produce the same error, I will try to get the logs. In the meantime, if there's any chance that there is a repo out there with those three languages already as .arrow files, or if you know about how much memory would be needed to actually download those sets, please let me know!
Describe the bug
When downloading some
wikipedia
languages (in particular, I'm having a hard time with Spanish, Cebuano, and Russian) via FlinkRunner, I encounter the exception in the title. I have been playing with package versions a lot, because unfortunately, the different dependencies required by these packages seem to be incompatible in terms of versions (dill and requests, for instance). It should be noted that the following code runs for several hours without issue, executing theload_dataset()
function, before the exception occurs.Steps to reproduce the bug
Expected results
Although some warnings are generally produced by this code (run in Colab Notebook), most languages I've tried have been successfully downloaded. It should simply go through without issue, but for these languages, I am continually encountering this error.
Actual results
Traceback below:
Environment info
datasets
version: 2.3.2