Closed sonNeturo closed 4 years ago
Based on your call stack, the failure occurs inside make_batch_reader call, i.e. before you start iterating on the data. The crash occurs when pyarrow tries to open on of parquet files in a parquet dataset. This happens on a threadpool, i.e. multiple files are being opened in parallel. Immediate suspects are thread-safety issues in pyarrow or underlying hdfs driver (assuming you are using hdfs).
I'd try the following:
hdfs_driver=libhdfs
argument to make_batch_reader
in order to use hadoop official hdfs driver (petastorm defaults to libhdsf3
) thread_pool = futures.ThreadPoolExecutor(1)
to reduce amount of parallelism.Thanks @selitvin for your answer.
workers_count = 1
to make_batch_reader
to reduce parallelism?Does petastorm read from a single file in multiple threads? Trying to understand where the race-condition might be.
@sonNeturo could you please provide more information:
I run MNIST sample multiple times, and everything works fine. Could you try running following and see if issue can be reproduced for this dataset: python3 -m examples.mnist.pytorch_example --dataset-url=gs://alekseyv-scalableai-dev/petastorm_mnist or python3 -m examples.mnist.tf_example --dataset-url=gs://alekseyv-scalableai-dev/petastorm_mnist
I haven't had the issue for more than a week now, for a job that runs 3x times a day... Haven't changed anything in my code. I'm closing the issue for now, unless I reencounter the problem and have more details to share. To answer @vlasenkoalexey:
When training a model, it randomly fails with following error message:
The error happens randomly. Sometimes an epoch finishes without an issue, but fails in the next one. My code is pretty simple:
Could this error be due to the distributed settings? I'am using a dataproc cluster with 20 workers and a GPU.
versions:
Thanks for the help.