uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

[ML-9743] Address S3 eventually consistency issue on S3-like filesystem #514

Closed WeichenXu123 closed 4 years ago

WeichenXu123 commented 4 years ago

For some filesystem like aws S3, there's an eventually consistency issue: after we materialized dataframe as parquet files, when use petastorm reader, the file may not be available immediately. We need wait at most 30 seconds to make sure all files are available. This is caused by S3 list operation eventually consistency.

codecov[bot] commented 4 years ago

Codecov Report

Merging #514 into master will increase coverage by 0.04%. The diff coverage is 95.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #514      +/-   ##
==========================================
+ Coverage   86.17%   86.22%   +0.04%     
==========================================
  Files          81       81              
  Lines        4421     4451      +30     
  Branches      704      708       +4     
==========================================
+ Hits         3810     3838      +28     
- Misses        502      503       +1     
- Partials      109      110       +1     
Impacted Files Coverage Δ
petastorm/spark/spark_dataset_converter.py 92.82% <95.91%> (+0.08%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6b4ce00...0bfa194. Read the comment docs.