uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

parquet.enable.summary-metadata and add_metadata are deprecated #499

Open filipski opened 4 years ago

filipski commented 4 years ago

I get the following two deprecation warnings when running code for ingest_folder2() from https://github.com/uber/petastorm/issues/497#issuecomment-594078434, which is a simplified version of https://github.com/uber/petastorm/blob/master/examples/imagenet/generate_petastorm_imagenet.py

2020-03-03 18:42:47,276 WARN hadoop.ParquetOutputFormat: Setting parquet.enable.summary-metadata is deprecated, please use parquet.summary.metadata.level
/home/user/miniconda3/envs/petastorm/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py:192: FutureWarning: The 'add_metadata' method is deprecated, use 'with_metadata' instead
  utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema)

My environment: petastorm==0.8.2 pyspark=2.4.4=py_0

Should this be fixed in next releases?

filipski commented 4 years ago

At least the second warning should be pretty simple at least looking at: https://github.com/apache/arrow/blob/23cff432561cc1e4e723a09c36e0fc1295be5bbb/python/pyarrow/types.pxi#L1253

selitvin commented 4 years ago

Thank you for pointing these out. Added #528 and #529 to address these warnings.