intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
11 stars 3 forks source link

tfds.load cannot download dataset if set hdfs dir #14

Closed 704572066 closed 2 years ago

704572066 commented 2 years ago

微信图片_20211206113640 微信图片_20211206113650 微信图片_20211206113701

yangw1234 commented 2 years ago

Hi, your code has not run into analytics zoo yet and it looks like TensorFlow is not set up to read HDFS.

Would you mind trying to follow the steps here and make sure TensorFlow can access HDFS? https://github.com/tensorflow/docs/blob/r1.11/site/en/deploy/hadoop.md

704572066 commented 2 years ago

after env set it can download data,but I came across another problem:Py4JJavaError: An error occurred while calling o69.estimatorTrainMiniBatch. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 20, hadoop102, executor 2): org.tensorflow.TensorFlowException: hdfs:///user/root/mnist/3.0.0/mnist-train.tfrecord-00000-of-00001; Unknown error 255 Untitled5.md

微信截图_20211206233951 微信截图_20211207094529

yangw1234 commented 2 years ago

@704572066 This may be because of HDFS configuration issues. Could you also try if your pyspark program can read the hdfs files (without using tensorflow, just using spark)? Here is also a link about troubleshooting common causes for "Unknown error 255". https://www.ibm.com/support/pages/datastage-bdfs-stage-gets-error-255-connecting-remote-hadoophdfs-server

In the meantime, we'll try to reproduce on our side.

704572066 commented 2 years ago

@704572066 This may be because of HDFS configuration issues. Could you also try if your pyspark program can read the hdfs files (without using tensorflow, just using spark)? Here is also a link about troubleshooting common causes for "Unknown error 255". https://www.ibm.com/support/pages/datastage-bdfs-stage-gets-error-255-connecting-remote-hadoophdfs-server

In the meantime, we'll try to reproduce on our side.

Thank you, I have resolved the problem. its because of the libhdfs.so, I used to copy the file in the directory:“/opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib64/” , then I tried the file in the directory:“/etc/hadoop/conf/lib/native” and it works!