Configure Petastrom to point to HDFS and SPARK

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

By default, Petastorm uses libhdfs3 driver. You need to make sure that your HADOOP_HOME environment points to a proper hadoop installation. If hadoop is configured properly, you should be able to run something like this from command line:

hdfs dfs -ls /

Once it succeeds, petastorm should work. You can also try specifying hdfs_driver='libhdfs' as make_reader parameter. That would force petastorm to use offical hadoop driver (this one is written in Java).

Might be a good idea to consult with the hadoop system administrator to get help configuring hadoop for your particular environment.

uber / petastorm

Configure Petastrom to point to HDFS and SPARK #395