Closed prakashmstpt closed 5 years ago
By default, Petastorm uses libhdfs3 driver. You need to make sure that your HADOOP_HOME
environment points to a proper hadoop installation. If hadoop is configured properly, you should be able to run something like this from command line:
hdfs dfs -ls /
Once it succeeds, petastorm should work. You can also try specifying hdfs_driver='libhdfs'
as make_reader parameter. That would force petastorm to use offical hadoop driver (this one is written in Java).
Might be a good idea to consult with the hadoop system administrator to get help configuring hadoop for your particular environment.
What are the configuration steps to point Hadoop cluster. Just passing hdfs://nn:8020/file is not working. Do we need any specific setup required in pyarrow? Similarly do we need to specify spark-defults.conf
I appreciate your input
PS: if you have specific instruction for Cloudera or any other kerberos enabled cluster that will be useful as well.