uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Configure Petastrom to point to HDFS and SPARK #395

Closed prakashmstpt closed 5 years ago

prakashmstpt commented 5 years ago

What are the configuration steps to point Hadoop cluster. Just passing hdfs://nn:8020/file is not working. Do we need any specific setup required in pyarrow? Similarly do we need to specify spark-defults.conf

I appreciate your input

PS: if you have specific instruction for Cloudera or any other kerberos enabled cluster that will be useful as well.

selitvin commented 5 years ago

By default, Petastorm uses libhdfs3 driver. You need to make sure that your HADOOP_HOME environment points to a proper hadoop installation. If hadoop is configured properly, you should be able to run something like this from command line:

hdfs dfs -ls /

Once it succeeds, petastorm should work. You can also try specifying hdfs_driver='libhdfs' as make_reader parameter. That would force petastorm to use offical hadoop driver (this one is written in Java).

Might be a good idea to consult with the hadoop system administrator to get help configuring hadoop for your particular environment.