Open quiescentsam opened 5 years ago
Can you please confirm that your HADOOP_HOME environment variable points to a valid hadoop installation directory, specifically that $HADOOP_HOME/etc/hadoop/hdfs-site.xml
and $HADOOP_HOME/etc/hadoop/core-site.xml
are valid. Does using hdfs dfs -ls /
works for you from the command line (and $HADOOP_HOME
is set to the same value as when you are running your python program that uses petastorm)?
I face the same issue now. My HADOOP_HOME is set correctly:
$ echo $HADOOP_HOME
/usr/local/hadoop
I have /usr/local/hadoop/etc/hadoop/hdfs-site.xml configured as well. I set HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH in /usr/local/spark/conf/spark-env.sh as follows:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
And surely hdfs dfs -ls /
works fine:
$ hdfs dfs -ls /
Found 2 items
drwxrwx--- - hduser hadoop 0 2020-01-23 09:04 /tmp
drwxr-xr-x - hduser supergroup 0 2020-01-16 07:52 /user
I tried to store the 'hello world' dataset on HDFS, simply by changing the output_url in https://github.com/uber/petastorm/blob/e8b9f74c8db63f74c2f3b1658829089ee2d2ccdf/examples/hello_world/petastorm_dataset/generate_petastorm_dataset.py#L43 to:
def generate_petastorm_dataset(output_url='hdfs:///tmp/hello_world_dataset'):
As you see from the hdfs dfs -ls /
above, /tmp exists on the HDFS and has correct access rights to the hadoop group which I'm using to start generate_petastorm_dataset.py.
What else am I missing?
OK, I took a closer look at: https://github.com/uber/petastorm/blob/e8b9f74c8db63f74c2f3b1658829089ee2d2ccdf/petastorm/hdfs/namenode.py#L110 and consequently: https://github.com/uber/petastorm/blob/e8b9f74c8db63f74c2f3b1658829089ee2d2ccdf/petastorm/hdfs/namenode.py#L84
and it looks petastorm is coded to work on high-availability (HA) cluster only, as it requires non-empty list of namenodes from 'dfs.ha.namenodes.' Hadoop configuration.
My cluster is a simple sandbox installation with a single namenode. Do I have to configure HA cluster or is there a way to use petastorm on a simple cluster with just a single name node?
Well, not on purpose. In our clouds we have only HA ones. I guess our options are:
dfs.ha.namenodes
with a list of addresses that both point to the same local namenode work?I actually went ahead and reconfigured the cluster into a proper HA one, as at the end we'd need it this way anyway. And I confirm that with that config everything works well. Actually, one thing worth mentioning and updating your documentation: storing to HDFS by default requires libhdfs3 and if it's missing one gets pretty cryptic exceptions, as you catch the one clearly saying that the lib is missing and raise your own. So, consider mentioning that dependency in the installation section of your documentation or make a dependency for automatic installation, if that makes sense.
Thank you for the feedback. Will leave the ticket open to track documentation update and to improve error messages in this scenario.
@selitvin I had the same problem. I am using a docker image which is running a HDFS cluster. I set the values for dfs.ha.namenodes but it doesn't change anything. Any ideas?
The code will try to load configuration from $HADOOP_HOME/etc/hadoop/, $HADOOP_PREFIX/etc/hadoop/ and $HADOOP_INSTALL/etc/hadoop/ location ( in this order ). Is it possible that we don't find the right hdfs-site.xml
, core-site.xml
files?
Hi,
Traceback (most recent call last): File "", line 1, in
File "/petastorm_venv3.6/lib/python3.6/site-packages/petastorm/reader.py", line 120, in make_reader
resolver = FilesystemResolver(dataset_url, hdfs_driver=hdfs_driver)
File "/petastorm_venv3.6/lib/python3.6/site-packages/petastorm/fs_utils.py", line 96, in init
nameservice, namenodes = namenode_resolver.resolve_default_hdfs_service()
File "/petastorm_venv3.6/lib/python3.6/site-packages/petastorm/hdfs/namenode.py", line 124, in resolve_default_hdfs_service
.format(default_fs)))
OSError: Unable to get namenodes for default service "hdfs://master:8020" from
Hadoop path /opt/cloudera/parcels/CDH/lib/hadoop in environment variable HADOOP_HOME!
Please check your hadoop configuration!