uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

Make petastorm reader support dataset url list. #503

Closed WeichenXu123 closed 4 years ago

WeichenXu123 commented 4 years ago

Currently, make_batch_reader and make_reader only accept a directory as a dataset url, but sometimes we need specify parquet file list as input, the reason is:

We may want to use petastorm on aws S3, but S3 only provide eventually consistency for dir list operation (it is unreliable). So that, for S3 we need special code logic to get parquet file list and pass file list to make_batch_reader
Old version pyarrow do not allow files starts with underscore exists in the parquet directory (except the "_metadata" file). But some parquet implementation may put some commit protocol file such as _start, _succuss, make reader support dataset url list will address this issue.

Test

UT added.

End to end test code:


from pyspark.sql.functions import pandas_udf
import numpy as np
import os
from petastorm import make_batch_reader

@pandas_udf('array<float>')
def gen_array(v):
  return v.map(lambda x: np.random.rand(10))

df1 = spark.range(20).repartition(2).withColumn('v', gen_array('id'))

data_url = 'file:///tmp/t0001'
data_path = '/tmp/t0001'
df1.repartition(2).write.mode('overwrite').option("compression", "uncompressed").option("parquet.block.size", 1024 * 1024).parquet(data_url)

def get_pq_url_list(dir_path):
  lst = []
  for f in os.listdir(dir_path):
    if f.endswith('.parquet'):
      lst.append('file://' + os.path.join(dir_path, f))
  return lst

url_list = get_pq_url_list(data_path)
print(url_list)
reader = make_batch_reader(url_list)
for i in reader:
    print(str(i))

codecov[bot] commented 4 years ago

Codecov Report

Merging #503 into master will increase coverage by 0.10%. The diff coverage is 96.00%.

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
+ Coverage   85.98%   86.09%   +0.10%     
==========================================
  Files          81       81              
  Lines        4311     4358      +47     
  Branches      674      694      +20     
==========================================
+ Hits         3707     3752      +45     
- Misses        499      500       +1     
- Partials      105      106       +1

Impacted Files	Coverage Δ
petastorm/reader.py	`90.99% <90.90%> (+0.17%)`	:arrow_up:
petastorm/arrow_reader_worker.py	`92.00% <100.00%> (ø)`
petastorm/etl/dataset_metadata.py	`88.88% <100.00%> (ø)`
petastorm/fs_utils.py	`91.01% <100.00%> (+2.27%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e3acecf...5ac9288. Read the comment docs.

WeichenXu123 commented 4 years ago

@selitvin make_reader test failed. make_reader need more change to support reading url list. (ParquetDataset when accept file list, it cannot read metadata_path file...) But only support make_batch_reader is OK, because spark DL converter implements via make_batch_reader.

selitvin commented 4 years ago

Having support for dataset url list only in make_batch_reader is ok with me if the make_reader adds too much work.

WeichenXu123 commented 4 years ago

@selitvin Ready.