uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Fix bug: respect dynamically changed parent cache dir conf #537

Closed liangz1 closed 4 years ago

liangz1 commented 4 years ago

Bug: If we execute the following code:

spark.conf.set("petastorm.spark.converter.parentCacheDirUrl", "file:///url1")
converter1 = make_spark_converter(df)   # This works fine.
# Change conf
spark.conf.set("petastorm.spark.converter.parentCacheDirUrl", "file:///url2")
converter2 = make_spark_converter(df)

The last line will hit cache: The median size (1333465) of these parquet files (file:/url1/file_abc.parquet) is too small.Increase file sizes by repartition or coalesce spark dataframe, which will help improve performance. indicating that the new conf parent dir is not respected (still hitting the .../url1 dir).

Fix: We respect the conf change by adding an equality test against the parent cache dir.

codecov[bot] commented 4 years ago

Codecov Report

Merging #537 into master will increase coverage by 0.01%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
+ Coverage   86.52%   86.53%   +0.01%     
==========================================
  Files          85       85              
  Lines        4713     4717       +4     
  Branches      743      743              
==========================================
+ Hits         4078     4082       +4     
  Misses        516      516              
  Partials      119      119              
Impacted Files Coverage Δ
petastorm/fs_utils.py 91.75% <100.00%> (+0.74%) :arrow_up:
petastorm/reader.py 90.73% <100.00%> (-0.27%) :arrow_down:
petastorm/spark/spark_dataset_converter.py 92.50% <100.00%> (+0.05%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c7b8475...6ffdb5e. Read the comment docs.

WeichenXu123 commented 4 years ago

Nit: normalize_dataset_url apply on a parent dir, this make code confusing, so rename normalize_dataset_url to normalize_dir_url and move it into petastorm.fs_utils package.