liangz1 commented 4 years ago

Bug: If we execute the following code:

spark.conf.set("petastorm.spark.converter.parentCacheDirUrl", "file:///url1")
converter1 = make_spark_converter(df)   # This works fine.
# Change conf
spark.conf.set("petastorm.spark.converter.parentCacheDirUrl", "file:///url2")
converter2 = make_spark_converter(df)

The last line will hit cache: The median size (1333465) of these parquet files (file:/url1/file_abc.parquet) is too small.Increase file sizes by repartition or coalesce spark dataframe, which will help improve performance. indicating that the new conf parent dir is not respected (still hitting the .../url1 dir).

Fix: We respect the conf change by adding an equality test against the parent cache dir.

codecov[bot] commented 4 years ago

Codecov Report

Merging #537 into master will increase coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
+ Coverage   86.52%   86.53%   +0.01%     
==========================================
  Files          85       85              
  Lines        4713     4717       +4     
  Branches      743      743              
==========================================
+ Hits         4078     4082       +4     
  Misses        516      516              
  Partials      119      119

Impacted Files	Coverage Δ
petastorm/fs_utils.py	`91.75% <100.00%> (+0.74%)`	:arrow_up:
petastorm/reader.py	`90.73% <100.00%> (-0.27%)`	:arrow_down:
petastorm/spark/spark_dataset_converter.py	`92.50% <100.00%> (+0.05%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c7b8475...6ffdb5e. Read the comment docs.

WeichenXu123 commented 4 years ago

Nit: normalize_dataset_url apply on a parent dir, this make code confusing, so rename normalize_dataset_url to normalize_dir_url and move it into petastorm.fs_utils package.

uber / petastorm

Fix bug: respect dynamically changed parent cache dir conf #537

Codecov Report