intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
11 stars 3 forks source link

[BigDL2.0] autoestimator_pytorch hdfs path can not save model on k8s #22

Open Le-Zheng opened 2 years ago

Le-Zheng commented 2 years ago

http://10.112.231.51:18888/view/BigDL-2.0-NB/job/BigDL-NB-K8s-ExampleTests/152/console

(pid=244, ip=172.30.27.4) /opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/model/base_pytorch_model.py:180: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
(pid=244, ip=172.30.27.4)   return torch.from_numpy(inp)
(pid=244, ip=172.30.27.4) 
  0%|          | 0/16 [00:00<?, ?it/s]/usr/local/envs/pytf1/lib/python3.7/site-packages/torch/autograd/__init__.py:132: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
(pid=244, ip=172.30.27.4)   allow_unreachable=True)  # allow_unreachable flag
(pid=244, ip=172.30.27.4) 
Loss: 0.6922382116317749:   0%|          | 0/16 [00:00<?, ?it/s]
Loss: 0.4504893720149994:   6%|▋         | 1/16 [00:00<00:00, 50.22it/s]
(pid=244, ip=172.30.27.4) 
Loss: 0.27864789962768555:  12%|█▎        | 2/16 [00:00<00:00, 82.55it/s]
Loss: 0.18915259838104248:  19%|█▉        | 3/16 [00:00<00:00, 106.19it/s]
Loss: 0.112899050116539:  25%|██▌       | 4/16 [00:00<00:00, 124.31it/s]  
Loss: 0.09547075629234314:  31%|███▏      | 5/16 [00:00<00:00, 138.47it/s]
Loss: 0.029641583561897278:  38%|███▊      | 6/16 [00:00<00:00, 150.55it/s]
Loss: 0.056755051016807556:  44%|████▍     | 7/16 [00:00<00:00, 160.61it/s]
Loss: 0.019430123269557953:  50%|█████     | 8/16 [00:00<00:00, 170.19it/s]
Loss: 0.002557608764618635:  56%|█████▋    | 9/16 [00:00<00:00, 178.60it/s]
Loss: 0.004579346626996994:  62%|██████▎   | 10/16 [00:00<00:00, 185.35it/s]
Loss: 0.0019340637372806668:  69%|██████▉   | 11/16 [00:00<00:00, 192.40it/s]
Loss: 0.00223898165859282:  75%|███████▌  | 12/16 [00:00<00:00, 198.61it/s]  
Loss: 0.005255652591586113:  81%|████████▏ | 13/16 [00:00<00:00, 200.80it/s]
Loss: 0.00018203322542831302:  88%|████████▊ | 14/16 [00:00<00:00, 206.26it/s]
Loss: 0.055765699595212936:  94%|█████████▍| 15/16 [00:00<00:00, 212.25it/s]  
Loss: 0.055765699595212936: 100%|██████████| 16/16 [00:00<00:00, 225.74it/s]
(pid=245, ip=172.30.27.4) /opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/model/base_pytorch_model.py:180: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
(pid=245, ip=172.30.27.4)   return torch.from_numpy(inp)
(pid=245, ip=172.30.27.4) 
  0%|          | 0/16 [00:00<?, ?it/s]/usr/local/envs/pytf1/lib/python3.7/site-packages/torch/autograd/__init__.py:132: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
(pid=245, ip=172.30.27.4)   allow_unreachable=True)  # allow_unreachable flag
(pid=245, ip=172.30.27.4) 
Loss: 0.6456587314605713:   0%|          | 0/16 [00:00<?, ?it/s]
(pid=244, ip=172.30.27.4) 2021-11-04 00:35:35,556  ERROR function_runner.py:254 -- Runner Thread raised error.
(pid=244, ip=172.30.27.4) Traceback (most recent call last):
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
(pid=244, ip=172.30.27.4)     self._entrypoint()
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
(pid=244, ip=172.30.27.4)     self._status_reporter.get_checkpoint())
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
(pid=244, ip=172.30.27.4)     output = fn()
(pid=244, ip=172.30.27.4)   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 325, in train_func
(pid=244, ip=172.30.27.4)   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
(pid=244, ip=172.30.27.4)     if remote_ckpt_basename not in get_remote_list(remote_dir):
(pid=244, ip=172.30.27.4)   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 46, in get_remote_list
(pid=244, ip=172.30.27.4)     s_output, _ = process(args)
(pid=244, ip=172.30.27.4) TypeError: cannot unpack non-iterable NoneType object
(pid=244, ip=172.30.27.4) Exception in thread Thread-2:
(pid=244, ip=172.30.27.4) Traceback (most recent call last):
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(pid=244, ip=172.30.27.4)     self.run()
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 267, in run
(pid=244, ip=172.30.27.4)     raise e
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
(pid=244, ip=172.30.27.4)     self._entrypoint()
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
(pid=244, ip=172.30.27.4)     self._status_reporter.get_checkpoint())
(pid=244, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/tune/function_runner.py", line 576, in _trainable_func
(pid=244, ip=172.30.27.4)     output = fn()
(pid=244, ip=172.30.27.4)   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/ray_tune/ray_tune_search_engine.py", line 325, in train_func
(pid=244, ip=172.30.27.4)   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 72, in put_ckpt_hdfs
(pid=244, ip=172.30.27.4)     if remote_ckpt_basename not in get_remote_list(remote_dir):
(pid=244, ip=172.30.27.4)   File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/automl/search/utils.py", line 46, in get_remote_list
(pid=244, ip=172.30.27.4)     s_output, _ = process(args)
(pid=244, ip=172.30.27.4) TypeError: cannot unpack non-iterable NoneType object
(pid=244, ip=172.30.27.4) 
(pid=245, ip=172.30.27.4) 
Loss: 0.4749995172023773:   6%|▋         | 1/16 [00:00<00:00, 48.86it/s]
Loss: 0.3644247055053711:  12%|█▎        | 2/16 [00:00<00:00, 81.42it/s]
Loss: 0.19700123369693756:  19%|█▉        | 3/16 [00:00<00:00, 105.65it/s]
Loss: 0.15083497762680054:  25%|██▌       | 4/16 [00:00<00:00, 123.93it/s]
Loss: 0.1125955805182457:  31%|███▏      | 5/16 [00:00<00:00, 138.76it/s] 
Loss: 0.07053384184837341:  38%|███▊      | 6/16 [00:00<00:00, 150.92it/s]
Loss: 0.04681260883808136:  44%|████▍     | 7/16 [00:00<00:00, 161.47it/s]
Loss: 0.02035798318684101:  50%|█████     | 8/16 [00:00<00:00, 170.66it/s]
Loss: 0.012909774668514729:  56%|█████▋    | 9/16 [00:00<00:00, 178.95it/s]
Loss: 0.0078040556982159615:  62%|██████▎   | 10/16 [00:00<00:00, 186.17it/s]
Loss: 0.04752806946635246:  69%|██████▉   | 11/16 [00:00<00:00, 192.78it/s]  
Loss: 0.019220085814595222:  75%|███████▌  | 12/16 [00:00<00:00, 198.82it/s]
Loss: 0.010350744239985943:  81%|████████▏ | 13/16 [00:00<00:00, 200.81it/s]
Loss: 0.0005109629710204899:  88%|████████▊ | 14/16 [00:00<00:00, 206.25it/s]
(pid=244, ip=172.30.27.4) 
(pid=244, ip=172.30.27.4) /bin/sh: hdfs: command not found
Le-Zheng commented 2 years ago

@yushan111

shanyu-sys commented 2 years ago

AutoEstimator currently only supports distributed on clusters with hdfs, therefore doesn't support k8s for now.