uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

pyarrow.lib.ArrowIOError: Prior attempt to load libhdfs3 failed #418

Closed okedoki closed 5 years ago

okedoki commented 5 years ago

Hello,

I have an issue when I'm trying to read a parquet file from HDFS.

When I run an example : import tensorflow as tf from petastorm import make_reader with make_reader(path) as reader: dataset = make_petastorm_dataset(reader) iterator = dataset.make_one_shot_iterator() tensor = iterator.get_next() with tf.Session() as sess: sample = sess.run(tensor) print(sample.id)

Traceback (most recent call last): File "", line 1, in File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/reader.py", line 120, in make_reader resolver = FilesystemResolver(dataset_url, hdfs_driver=hdfs_driver) File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/fs_utils.py", line 89, in init self._filesystem = connector.hdfs_connect_namenode(self._parsed_dataset_url, user=user) File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/hdfs/namenode.py", line 266, in hdfs_connect_namenode return pyarrow.hdfs.connect(hostname, url.port or 8020, driver=driver, user=user) File "/home/olbico/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 215, in connect extra_conf=extra_conf) File "/home/olbico/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 40, in init self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 93, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Prior attempt to load libhdfs3 failed

If I try to call the function directly from pyarrow I dont have any problems

from pyarrow import hdfs fs = hdfs.connect() with fs.open(path) as f:

I work in conda environment with installed libhdfs3

I would appreciate any suggestions to solve this problem. Thanks.

praateekmahajan commented 5 years ago

I am gonna offer something extremely trivial, but might be helpful for debugging. In shell can you run

conda activate -n petastorm_test python=3.7
pip install petastorm
pip install tensorflow
conda activate petastorm_test

And then run your code? I'll probably imagine this issue to be a env issue (I see you're using python 2.7 also, and would recommend moving to python3)

okedoki commented 5 years ago

@praateekmahajan thanks for getting back to me

unfortunately, didnt help. moreover I have got another extra issue with tensorflow( ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' )

selitvin commented 5 years ago

In your example you do:

hdfs.connect(hostname)

however, it uses libhdfs (java based) driver by default. Would it still succeed if you pass an optional driver='libhdfs3'?

hdfs.connect(hostname, driver='libhdfs3')

Is it possible to use libhdfs driver in your case: make_reader(..., hdfs_driver='libhdfs')?

okedoki commented 5 years ago

@selitvin Yes, it is succeeded with hdfs.connect(hostname, driver='libhdfs3')

It looks like make_reader(..., hdfs_driver='libhdfs') fixes this problem! though I have another now : raise RuntimeError('Currently make_reader supports reading only Petastorm datasets. ' RuntimeError: Currently make_reader supports reading only Petastorm datasets. To read from a non-Petastorm Parquet store use make_batch_reader

Are they related?

selitvin commented 5 years ago

This is something different. This is about trying to read a parquet store that was not created using petastorm, but some external parquet store. You may find some information here: https://github.com/uber/petastorm#non-petastorm-parquet-stores

Please let me know if it clarifies the picture or you need more information.

okedoki commented 5 years ago

Thanks. it seems to be fixed. I have another issue, but this is one can be closed.