uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

[ML-10366] Fix bug: cleanup metadata in converter.delete() #534

Closed WeichenXu123 closed 4 years ago

WeichenXu123 commented 4 years ago

When the user wants to manually delete the cache files and create a new cache, they do:

converter.delete()
converter = make_spark_converter(df)

Get error:

org.apache.spark.sql.AnalysisException: Path does not exist: file:/dbfs/ml/tmp/petastorm/QA/keras/20200407043623-appid-app-20200406162107-0000-d7d7cdfd-3739-48d0-9dc1-bffba08fff12;

Fix: We should cleanup cache metadata in SparkDatasetConverter.delete() besides deleting the data files.

codecov[bot] commented 4 years ago

Codecov Report

Merging #534 into master will decrease coverage by 0.00%. The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #534      +/-   ##
==========================================
- Coverage   86.53%   86.52%   -0.01%     
==========================================
  Files          85       85              
  Lines        4692     4697       +5     
  Branches      737      739       +2     
==========================================
+ Hits         4060     4064       +4     
  Misses        515      515              
- Partials      117      118       +1     
Impacted Files Coverage Δ
petastorm/spark/spark_dataset_converter.py 92.39% <75.00%> (-0.25%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fbfad23...b85bd68. Read the comment docs.