uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Petastorm requires hadoop on client? #607

Closed ychnh closed 4 years ago

ychnh commented 4 years ago

I am very new to database programming and I am not very sure on this area. I have a hdfs on 2 server nodes and I tried making a reader into a parquet table. Then on my AI machine I run the following code and I get an error. Does the AI server need to have hadoop installed as well?

cmd = 'hdfs://192.168.0.32:9000/TrainingData/34611012/2048_1536/tif/parquet/'
make_reader(cmd)
------------
Unable to populate a sensible HadoopConfiguration for namenode resolution!
Path of last environment var (None) tried [None]. Please set up your Hadoop and 
define environment variable HADOOP_HOME to point to your Hadoop installation path.
selitvin commented 4 years ago

Yes. In order to access HDFS you would need hadoop to be installed and properly configured on a machine that tries reading data from hdfs.

ychnh commented 4 years ago

Thank you very much.