uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Fix: The "median size too small" warning is too frequent #538

Closed liangz1 closed 4 years ago

liangz1 commented 4 years ago

Problem: This warning below shows up even when there is only one partition, and the file list could be too long to read.

The median size (1333465) of these parquet files (file list...) is too small.Increase file sizes by repartition or coalesce spark dataframe, which will help improve performance.

Fix:

  1. Only print warning for (num_of_files > 1 && median file size < 50MB)
  2. Remove the file list in the warning.
  3. Provide the total size of the dataset and the threshold file size (50MB) in the warning to help users to decide how many partitions do they need.

Test: Added unit test in test_check_dataset_file_median_size.

codecov[bot] commented 4 years ago

Codecov Report

Merging #538 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #538   +/-   ##
=======================================
  Coverage   86.52%   86.52%           
=======================================
  Files          85       85           
  Lines        4697     4699    +2     
  Branches      739      740    +1     
=======================================
+ Hits         4064     4066    +2     
  Misses        515      515           
  Partials      118      118           
Impacted Files Coverage Δ
petastorm/spark/spark_dataset_converter.py 92.45% <100.00%> (+0.05%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 43494ec...7e6bc72. Read the comment docs.