uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Make petastorm reader support dataset url list. #503

Closed WeichenXu123 closed 4 years ago

WeichenXu123 commented 4 years ago

Currently, make_batch_reader and make_reader only accept a directory as a dataset url, but sometimes we need specify parquet file list as input, the reason is:

Test

UT added.

End to end test code:


from pyspark.sql.functions import pandas_udf
import numpy as np
import os
from petastorm import make_batch_reader

@pandas_udf('array<float>')
def gen_array(v):
  return v.map(lambda x: np.random.rand(10))

df1 = spark.range(20).repartition(2).withColumn('v', gen_array('id'))

data_url = 'file:///tmp/t0001'
data_path = '/tmp/t0001'
df1.repartition(2).write.mode('overwrite').option("compression", "uncompressed").option("parquet.block.size", 1024 * 1024).parquet(data_url)

def get_pq_url_list(dir_path):
  lst = []
  for f in os.listdir(dir_path):
    if f.endswith('.parquet'):
      lst.append('file://' + os.path.join(dir_path, f))
  return lst

url_list = get_pq_url_list(data_path)
print(url_list)
reader = make_batch_reader(url_list)
for i in reader:
    print(str(i))
codecov[bot] commented 4 years ago

Codecov Report

Merging #503 into master will increase coverage by 0.10%. The diff coverage is 96.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
+ Coverage   85.98%   86.09%   +0.10%     
==========================================
  Files          81       81              
  Lines        4311     4358      +47     
  Branches      674      694      +20     
==========================================
+ Hits         3707     3752      +45     
- Misses        499      500       +1     
- Partials      105      106       +1     
Impacted Files Coverage Δ
petastorm/reader.py 90.99% <90.90%> (+0.17%) :arrow_up:
petastorm/arrow_reader_worker.py 92.00% <100.00%> (ø)
petastorm/etl/dataset_metadata.py 88.88% <100.00%> (ø)
petastorm/fs_utils.py 91.01% <100.00%> (+2.27%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e3acecf...5ac9288. Read the comment docs.

WeichenXu123 commented 4 years ago

@selitvin make_reader test failed. make_reader need more change to support reading url list. (ParquetDataset when accept file list, it cannot read metadata_path file...) But only support make_batch_reader is OK, because spark DL converter implements via make_batch_reader.

selitvin commented 4 years ago

Having support for dataset url list only in make_batch_reader is ok with me if the make_reader adds too much work.

WeichenXu123 commented 4 years ago

@selitvin Ready.