uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Integration with Apache Hudi #623

Closed LuisMoralesAlonso closed 3 years ago

LuisMoralesAlonso commented 3 years ago

Hi all,

Is it in the roadmap to do this kind of integration? Being able to use pthon ecosystem directly on top of apache hudi (or other alternatives like iceberg or deltalake) through Petastorm?

Hope your comments, Luis

selitvin commented 3 years ago

We don't have anything special planned for hudi/iceberg/deltalake. The scope of Petastorm is to support direct reading from parquet stores DL frameworks such as pytorch/tensorflow. Not sure about hudi, but if it exposes paths to parquet stores, you should still be able to use Petastorm.