pyarrow.lib.ArrowIOError: Prior attempt to load libhdfs3 failed

okedoki commented 5 years ago

Hello,

I have an issue when I'm trying to read a parquet file from HDFS.

When I run an example : import tensorflow as tf from petastorm import make_reader with make_reader(path) as reader: dataset = make_petastorm_dataset(reader) iterator = dataset.make_one_shot_iterator() tensor = iterator.get_next() with tf.Session() as sess: sample = sess.run(tensor) print(sample.id)

Traceback (most recent call last): File "", line 1, in File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/reader.py", line 120, in make_reader resolver = FilesystemResolver(dataset_url, hdfs_driver=hdfs_driver) File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/fs_utils.py", line 89, in init self._filesystem = connector.hdfs_connect_namenode(self._parsed_dataset_url, user=user) File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/hdfs/namenode.py", line 266, in hdfs_connect_namenode return pyarrow.hdfs.connect(hostname, url.port or 8020, driver=driver, user=user) File "/home/olbico/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 215, in connect extra_conf=extra_conf) File "/home/olbico/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 40, in init self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 93, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Prior attempt to load libhdfs3 failed

If I try to call the function directly from pyarrow I dont have any problems

from pyarrow import hdfs fs = hdfs.connect() with fs.open(path) as f:

I work in conda environment with installed libhdfs3

I would appreciate any suggestions to solve this problem. Thanks.

praateekmahajan commented 5 years ago

I am gonna offer something extremely trivial, but might be helpful for debugging. In shell can you run

conda activate -n petastorm_test python=3.7
pip install petastorm
pip install tensorflow
conda activate petastorm_test

And then run your code? I'll probably imagine this issue to be a env issue (I see you're using python 2.7 also, and would recommend moving to python3)

okedoki commented 5 years ago

@praateekmahajan thanks for getting back to me

unfortunately, didnt help. moreover I have got another extra issue with tensorflow( ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' )

selitvin commented 5 years ago

In your example you do:

hdfs.connect(hostname)

however, it uses libhdfs (java based) driver by default. Would it still succeed if you pass an optional driver='libhdfs3'?

hdfs.connect(hostname, driver='libhdfs3')

Is it possible to use libhdfs driver in your case: make_reader(..., hdfs_driver='libhdfs')?

okedoki commented 5 years ago

@selitvin Yes, it is succeeded with hdfs.connect(hostname, driver='libhdfs3')

It looks like make_reader(..., hdfs_driver='libhdfs') fixes this problem! though I have another now : raise RuntimeError('Currently make_reader supports reading only Petastorm datasets. ' RuntimeError: Currently make_reader supports reading only Petastorm datasets. To read from a non-Petastorm Parquet store use make_batch_reader

Are they related?

selitvin commented 5 years ago

This is something different. This is about trying to read a parquet store that was not created using petastorm, but some external parquet store. You may find some information here: https://github.com/uber/petastorm#non-petastorm-parquet-stores

Please let me know if it clarifies the picture or you need more information.

okedoki commented 5 years ago

Thanks. it seems to be fixed. I have another issue, but this is one can be closed.

uber / petastorm

pyarrow.lib.ArrowIOError: Prior attempt to load libhdfs3 failed #418