Closed okedoki closed 5 years ago
I am gonna offer something extremely trivial, but might be helpful for debugging. In shell can you run
conda activate -n petastorm_test python=3.7
pip install petastorm
pip install tensorflow
conda activate petastorm_test
And then run your code? I'll probably imagine this issue to be a env issue (I see you're using python 2.7 also, and would recommend moving to python3)
@praateekmahajan thanks for getting back to me
unfortunately, didnt help. moreover I have got another extra issue with tensorflow( ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' )
In your example you do:
hdfs.connect(hostname)
however, it uses libhdfs
(java based) driver by default. Would it still succeed if you pass an optional driver='libhdfs3'
?
hdfs.connect(hostname, driver='libhdfs3')
Is it possible to use libhdfs driver in your case: make_reader(..., hdfs_driver='libhdfs')
?
@selitvin Yes, it is succeeded with hdfs.connect(hostname, driver='libhdfs3')
It looks like make_reader(..., hdfs_driver='libhdfs') fixes this problem! though I have another now : raise RuntimeError('Currently make_reader supports reading only Petastorm datasets. ' RuntimeError: Currently make_reader supports reading only Petastorm datasets. To read from a non-Petastorm Parquet store use make_batch_reader
Are they related?
This is something different. This is about trying to read a parquet store that was not created using petastorm, but some external parquet store. You may find some information here: https://github.com/uber/petastorm#non-petastorm-parquet-stores
Please let me know if it clarifies the picture or you need more information.
Thanks. it seems to be fixed. I have another issue, but this is one can be closed.
Hello,
I have an issue when I'm trying to read a parquet file from HDFS.
When I run an example : import tensorflow as tf from petastorm import make_reader with make_reader(path) as reader: dataset = make_petastorm_dataset(reader) iterator = dataset.make_one_shot_iterator() tensor = iterator.get_next() with tf.Session() as sess: sample = sess.run(tensor) print(sample.id)
Traceback (most recent call last): File "", line 1, in
File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/reader.py", line 120, in make_reader
resolver = FilesystemResolver(dataset_url, hdfs_driver=hdfs_driver)
File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/fs_utils.py", line 89, in init
self._filesystem = connector.hdfs_connect_namenode(self._parsed_dataset_url, user=user)
File "/home/olbico/anaconda2/lib/python2.7/site-packages/petastorm/hdfs/namenode.py", line 266, in hdfs_connect_namenode
return pyarrow.hdfs.connect(hostname, url.port or 8020, driver=driver, user=user)
File "/home/olbico/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 215, in connect
extra_conf=extra_conf)
File "/home/olbico/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 40, in init
self._connect(host, port, user, kerb_ticket, driver, extra_conf)
File "pyarrow/io-hdfs.pxi", line 93, in pyarrow.lib.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Prior attempt to load libhdfs3 failed
If I try to call the function directly from pyarrow I dont have any problems
from pyarrow import hdfs fs = hdfs.connect() with fs.open(path) as f:
I work in conda environment with installed libhdfs3
I would appreciate any suggestions to solve this problem. Thanks.