ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.18k stars 5.61k forks source link

NCCL Proxy Call to rank 1 failed - on Cloud VM Docker setup for huggingface distributed ray train script #40758

Open smiraldr opened 11 months ago

smiraldr commented 11 months ago

What happened + What you expected to happen

Ray train official docs script fails due to NCCL error. Proxy Call to rank 1 failed (Connect)

(base) root@smiral-0:~# python3 hf.py 
2023-10-27 13:07:32.861811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-27 13:07:33.637610: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-27 13:07:33.637699: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-27 13:07:33.637705: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.

                 Aim collects anonymous usage analytics.                 
                        Read how to opt-out here:                         
    https://aimstack.readthedocs.io/en/latest/community/telemetry.html    

2023-10-27 13:07:36,837 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.0.0.30:6379...
2023-10-27 13:07:36,847 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at 10.0.0.30:8265 
2023-10-27 13:07:36,898 INFO tune.py:228 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2023-10-27 13:07:36,900 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /home/ray/ray_results/TorchTrainer_2023-10-27_13-07-36
To visualize your results with TensorBoard, run: `tensorboard --logdir /home/ray/ray_results/TorchTrainer_2023-10-27_13-07-36`
(TrainTrainable pid=249, ip=10.0.0.78) comet_ml is installed but `COMET_API_KEY` is not set.
(TrainTrainable pid=249, ip=10.0.0.78) 2023-10-27 13:07:41.296436: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(TrainTrainable pid=249, ip=10.0.0.78) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=249, ip=10.0.0.78) 2023-10-27 13:07:42.245360: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=249, ip=10.0.0.78) 2023-10-27 13:07:42.245466: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=249, ip=10.0.0.78) 2023-10-27 13:07:42.245476: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=249, ip=10.0.0.78) --------------------------------------------------------------------------
(TrainTrainable pid=249, ip=10.0.0.78)                  Aim collects anonymous usage analytics.                 
(TrainTrainable pid=249, ip=10.0.0.78)                         Read how to opt-out here:                         
(TrainTrainable pid=249, ip=10.0.0.78)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(TrainTrainable pid=249, ip=10.0.0.78) --------------------------------------------------------------------------

Training started without custom configuration.
(TorchTrainer pid=249, ip=10.0.0.78) Starting distributed worker processes: ['301 (10.0.0.78)', '192 (10.0.0.19)']
(RayTrainWorker pid=301, ip=10.0.0.78) Setting up process group for: env:// [rank=0, world_size=2]
(RayTrainWorker pid=301, ip=10.0.0.78) comet_ml is installed but `COMET_API_KEY` is not set.
(RayTrainWorker pid=192, ip=10.0.0.19) 2023-10-27 13:07:48.587834: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(RayTrainWorker pid=192, ip=10.0.0.19) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=192, ip=10.0.0.19) 2023-10-27 13:07:49.590085: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=192, ip=10.0.0.19) 2023-10-27 13:07:49.590166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=192, ip=10.0.0.19) 2023-10-27 13:07:49.590175: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(RayTrainWorker pid=192, ip=10.0.0.19) -------------------------------------------------------------------------- [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=192, ip=10.0.0.19)                  Aim collects anonymous usage analytics.                 
(RayTrainWorker pid=192, ip=10.0.0.19)                         Read how to opt-out here:                         
(RayTrainWorker pid=192, ip=10.0.0.19)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
Downloading builder script: 4.39kB [00:00, 7.57MB/s]                   
Downloading metadata: 2.13kB [00:00, 12.5MB/s]                   
(RayTrainWorker pid=192, ip=10.0.0.19) Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43...
Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]
(RayTrainWorker pid=192, ip=10.0.0.19) comet_ml is installed but `COMET_API_KEY` is not set.
(RayTrainWorker pid=301, ip=10.0.0.78) 2023-10-27 13:07:48.573918: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(RayTrainWorker pid=301, ip=10.0.0.78) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading data:   0%|          | 275k/196M [00:00<01:11, 2.74MB/s]
Downloading data:   1%|▏         | 2.55M/196M [00:00<00:13, 14.5MB/s]
Downloading data:   5%|▌         | 10.5M/196M [00:00<00:07, 24.0MB/s]
Downloading data:   9%|▉         | 17.6M/196M [00:00<00:05, 32.2MB/s]
(RayTrainWorker pid=301, ip=10.0.0.78) 2023-10-27 13:07:49.554932: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 2x across cluster]
(RayTrainWorker pid=301, ip=10.0.0.78) 2023-10-27 13:07:49.554939: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading data:  11%|█         | 20.7M/196M [00:00<00:05, 30.7MB/s]
Downloading data:  15%|█▍        | 29.0M/196M [00:00<00:03, 43.9MB/s]
Downloading builder script: 4.39kB [00:00, 19.3MB/s]                   
Downloading metadata: 2.13kB [00:00, 10.4MB/s]                   
Downloading data:  95%|█████████▍| 186M/196M [00:04<00:00, 57.4MB/s]
Downloading data: 100%|██████████| 196M/196M [00:04<00:00, 39.5MB/s]
Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]
Downloading data:  74%|███████▍  | 145M/196M [00:04<00:01, 50.1MB/s] [repeated 66x across cluster]
Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]
Generating train split:   1%|          | 3267/650000 [00:00<00:19, 32666.53 examples/s]
Generating train split:   1%|          | 6538/650000 [00:00<00:19, 32634.86 examples/s]
Generating train split:   2%|▏         | 9802/650000 [00:00<00:19, 32173.33 examples/s]
Generating train split:   2%|▏         | 13021/650000 [00:00<00:21, 29565.11 examples/s]
Generating train split:   3%|▎         | 16356/650000 [00:00<00:20, 30848.32 examples/s]
Generating train split:   3%|▎         | 19637/650000 [00:00<00:20, 31486.84 examples/s]
Generating train split:   4%|▎         | 22806/650000 [00:00<00:20, 30771.04 examples/s]
Generating train split:   4%|▍         | 26290/650000 [00:00<00:19, 32018.05 examples/s]
Generating train split:   5%|▍         | 29639/650000 [00:00<00:19, 32466.46 examples/s]
Generating train split:   5%|▌         | 32897/650000 [00:01<00:19, 31116.62 examples/s]
Generating train split:   6%|▌         | 36203/650000 [00:01<00:19, 31685.33 examples/s]
Generating train split:   6%|▌         | 39598/650000 [00:01<00:18, 32352.53 examples/s]
Generating train split:   7%|▋         | 42846/650000 [00:01<00:19, 31504.91 examples/s]
Downloading data:  97%|█████████▋| 190M/196M [00:05<00:00, 49.3MB/s] [repeated 3x across cluster]
Downloading data: 100%|██████████| 196M/196M [00:05<00:00, 36.2MB/s]
Downloading data:  89%|████████▉ | 175M/196M [00:04<00:00, 49.7MB/s] [repeated 6x across cluster]
Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]
Generating train split:  26%|██▌       | 168111/650000 [00:05<00:15, 30936.80 examples/s] [repeated 73x across cluster]
Generating train split:  48%|████▊     | 314741/650000 [00:10<00:10, 31827.49 examples/s]
Generating train split:  41%|████▏     | 269356/650000 [00:08<00:12, 30778.09 examples/s] [repeated 89x across cluster]
Generating train split:  64%|██████▍   | 419126/650000 [00:13<00:09, 25081.02 examples/s] [repeated 91x across cluster]
Generating train split:  92%|█████████▏| 596313/650000 [00:19<00:01, 32090.18 examples/s]
Generating train split:  92%|█████████▏| 599606/650000 [00:19<00:01, 32334.63 examples/s]
Generating train split:  93%|█████████▎| 602850/650000 [00:19<00:01, 31170.37 examples/s]
Generating train split:  93%|█████████▎| 606409/650000 [00:19<00:01, 32441.70 examples/s]
Generating train split:  94%|█████████▍| 609879/650000 [00:19<00:01, 33099.22 examples/s]
Generating train split:  94%|█████████▍| 613202/650000 [00:19<00:01, 31960.80 examples/s]
Generating train split:  95%|█████████▍| 616414/650000 [00:20<00:01, 17698.44 examples/s]
Generating train split:  95%|█████████▌| 619922/650000 [00:20<00:01, 20922.10 examples/s]
Generating train split:  96%|█████████▌| 622799/650000 [00:20<00:01, 22565.31 examples/s]
Generating train split:  88%|████████▊ | 574987/650000 [00:19<00:02, 30604.77 examples/s] [repeated 85x across cluster]
Generating train split:  96%|█████████▋| 626053/650000 [00:20<00:00, 24860.26 examples/s]
Generating train split:  97%|█████████▋| 629429/650000 [00:20<00:00, 27061.78 examples/s]
Generating train split:  97%|█████████▋| 632510/650000 [00:20<00:00, 27423.92 examples/s]
Generating train split:  98%|█████████▊| 636006/650000 [00:20<00:00, 29438.77 examples/s]
Generating train split:  98%|█████████▊| 639430/650000 [00:20<00:00, 30764.39 examples/s]
Generating train split:  99%|█████████▉| 642665/650000 [00:20<00:00, 29839.58 examples/s]
Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]                  
Generating test split:   7%|▋         | 3306/50000 [00:00<00:01, 33049.95 examples/s]
Generating test split:  13%|█▎        | 6737/50000 [00:00<00:01, 33786.91 examples/s]
Generating test split:  20%|██        | 10116/50000 [00:00<00:01, 30716.99 examples/s]
Generating test split:  27%|██▋       | 13481/50000 [00:00<00:01, 31805.62 examples/s]
Generating test split:  34%|███▍      | 16947/50000 [00:00<00:01, 32795.91 examples/s]
Generating test split:  40%|████      | 20247/50000 [00:00<00:00, 31810.60 examples/s]
Generating test split:  47%|████▋     | 23686/50000 [00:00<00:00, 32618.73 examples/s]
Generating test split:  54%|█████▍    | 27158/50000 [00:00<00:00, 33269.35 examples/s]
Generating test split:  61%|██████    | 30497/50000 [00:00<00:00, 31725.25 examples/s]
Generating test split:  68%|██████▊   | 33955/50000 [00:01<00:00, 32564.90 examples/s]
Generating test split:  75%|███████▍  | 37457/50000 [00:01<00:00, 33290.23 examples/s]
Generating test split:  82%|████████▏ | 40801/50000 [00:01<00:00, 32023.71 examples/s]
Generating test split:  89%|████████▊ | 44313/50000 [00:01<00:00, 32918.70 examples/s]
100%|██████████| 2/2 [00:00<00:00, 585.76it/s]                                        
(RayTrainWorker pid=301, ip=10.0.0.78) Dataset yelp_review_full downloaded and prepared to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43. Subsequent calls will reuse this data.
(RayTrainWorker pid=301, ip=10.0.0.78) Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43...
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 107kB/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 4.23MB/s]
Downloading: 100%|██████████| 208k/208k [00:00<00:00, 33.3MB/s]
Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]                  
Downloading: 100%|██████████| 426k/426k [00:00<00:00, 25.2MB/s]
  0%|          | 0/1 [00:00<?, ?ba/s]) 
(RayTrainWorker pid=301, ip=10.0.0.78) [13:08:25] WARNING  Parameter 'function'=<function            fingerprint.py:328
(RayTrainWorker pid=301, ip=10.0.0.78)                     train_func.<locals>.tokenize_function at                    
(RayTrainWorker pid=301, ip=10.0.0.78)                     0x7fc3801ed940> of the transform                            
(RayTrainWorker pid=301, ip=10.0.0.78)                     datasets.arrow_dataset.Dataset._map_singl                   
(RayTrainWorker pid=301, ip=10.0.0.78)                     e couldn't be hashed properly, a random                     
(RayTrainWorker pid=301, ip=10.0.0.78)                     hash was used instead. Make sure your                       
(RayTrainWorker pid=301, ip=10.0.0.78)                     transforms and parameters are                               
(RayTrainWorker pid=301, ip=10.0.0.78)                     serializable with pickle or dill for the                    
(RayTrainWorker pid=301, ip=10.0.0.78)                     dataset fingerprinting and caching to                       
(RayTrainWorker pid=301, ip=10.0.0.78)                     work. If you reuse this transform, the                      
(RayTrainWorker pid=301, ip=10.0.0.78)                     caching mechanism will consider it to be                    
(RayTrainWorker pid=301, ip=10.0.0.78)                     different from the previous calls and                       
(RayTrainWorker pid=301, ip=10.0.0.78)                     recompute everything. This warning is                       
(RayTrainWorker pid=301, ip=10.0.0.78)                     only showed once. Subsequent hashing                        
(RayTrainWorker pid=301, ip=10.0.0.78)                     failures won't be showed.                                   
100%|██████████| 1/1 [00:00<00:00,  2.62ba/s]
  0%|          | 0/1 [00:00<?, ?ba/s]) 
100%|██████████| 1/1 [00:00<00:00,  3.71ba/s]
Generating train split: 100%|█████████▉| 649240/650000 [00:21<00:00, 30733.13 examples/s] [repeated 19x across cluster]
Generating test split:  92%|█████████▏| 46077/50000 [00:01<00:00, 31515.86 examples/s]
Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]
Downloading:   0%|          | 225k/416M [00:00<03:16, 2.22MB/s]
Downloading:   0%|          | 589k/416M [00:00<02:25, 3.00MB/s]
Downloading:   0%|          | 1.10M/416M [00:00<01:49, 3.99MB/s]
Downloading:   0%|          | 1.88M/416M [00:00<01:19, 5.43MB/s]
Downloading:   1%|          | 3.03M/416M [00:00<00:57, 7.53MB/s]
Downloading:   1%|          | 4.70M/416M [00:00<00:40, 10.8MB/s]
Downloading:   2%|▏         | 6.90M/416M [00:00<00:29, 14.6MB/s]
Downloading:   2%|▏         | 10.1M/416M [00:00<00:20, 20.6MB/s]
Generating train split:  91%|█████████ | 591483/650000 [00:19<00:01, 30362.73 examples/s] [repeated 5x across cluster]
Downloading:   3%|▎         | 14.5M/416M [00:00<00:14, 28.4MB/s]
Downloading:   5%|▌         | 20.9M/416M [00:01<00:10, 40.3MB/s]
Downloading:   7%|▋         | 28.7M/416M [00:01<00:07, 52.7MB/s]
Generating test split:  85%|████████▌ | 42623/50000 [00:01<00:00, 30305.49 examples/s] [repeated 13x across cluster]
100%|██████████| 2/2 [00:00<00:00, 609.02it/s]                                        
Downloading: 100%|██████████| 426k/426k [00:00<00:00, 27.3MB/s] [repeated 4x across cluster]
  0%|          | 0/1 [00:00<?, ?ba/s] [repeated 2x across cluster]
100%|██████████| 1/1 [00:00<00:00,  4.17ba/s] [repeated 2x across cluster]
Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]
Downloading:  25%|██▍       | 102M/416M [00:05<00:17, 19.0MB/s]  [repeated 34x across cluster]
Downloading:  56%|█████▌    | 232M/416M [00:10<00:12, 15.0MB/s] [repeated 59x across cluster]
Downloading:  90%|█████████ | 374M/416M [00:16<00:02, 16.4MB/s] [repeated 56x across cluster]
Downloading:  92%|█████████▏| 383M/416M [00:16<00:02, 15.4MB/s]
Downloading: 100%|██████████| 416M/416M [00:17<00:00, 25.6MB/s]
(RayTrainWorker pid=192, ip=10.0.0.19) Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
(RayTrainWorker pid=192, ip=10.0.0.19) - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=192, ip=10.0.0.19) - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=192, ip=10.0.0.19) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
(RayTrainWorker pid=192, ip=10.0.0.19) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=301, ip=10.0.0.78) Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
(RayTrainWorker pid=301, ip=10.0.0.78) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 12.0MB/s]
(RayTrainWorker pid=192, ip=10.0.0.19) The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=192, ip=10.0.0.19) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=192, ip=10.0.0.19)   warnings.warn(
2023-10-27 13:08:48,846 ERROR tune_controller.py:1502 -- Trial task failed for trial TorchTrainer_79388_00000
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::_Inner.train() (pid=249, ip=10.0.0.78, actor_id=37a2771439edeb1576a519a301000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 400, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(DistBackendError): ray::_RayTrainWorker__execute.get_next() (pid=192, ip=10.0.0.19, actor_id=a3e4eb7bfa5c10d44e303ba301000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fbfd5e7a100>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/hf.py", line 65, in train_func
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1317, in train
    return inner_training_loop(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1402, in _inner_training_loop
    model = self._wrap_model(self.model_wrapped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1230, in _wrap_model
    model = nn.parallel.DistributedDataParallel(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 1 failed (Connect)

Training errored after 0 iterations at 2023-10-27 13:08:48. Total running time: 1min 11s
Error file: /home/ray/ray_results/TorchTrainer_2023-10-27_13-07-36/TorchTrainer_79388_00000_0_2023-10-27_13-07-37/error.txt

2023-10-27 13:08:48,855 ERROR tune.py:1139 -- Trials did not complete: [TorchTrainer_79388_00000]
RayTaskError(DistBackendError): [36mray::_Inner.train()[39m (pid=249, ip=10.0.0.78, actor_id=37a2771439edeb1576a519a301000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 400, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(DistBackendError): [36mray::_RayTrainWorker__execute.get_next()[39m (pid=192, ip=10.0.0.19, actor_id=a3e4eb7bfa5c10d44e303ba301000000, 
repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7fbfd5e7a100>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/hf.py", line 65, in train_func
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1317, in train
    return inner_training_loop(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1402, in _inner_training_loop
    model = self._wrap_model(self.model_wrapped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1230, in _wrap_model
    model = nn.parallel.DistributedDataParallel(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 1 failed (Connect)

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ray/hf.py:72 in <module>                                                                   │
│                                                                                                  │
│   69 ray_trainer = TorchTrainer(                                                                 │
│   70 │   train_func, scaling_config=ScalingConfig(num_workers=2, use_gpu=True)                   │
│   71 )                                                                                           │
│ ❱ 72 ray_trainer.fit()                                                                           │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.9/site-packages/ray/train/base_trainer.py:668 in fit             │
│                                                                                                  │
│   665 │   │   if result.error:                                                                   │
│   666 │   │   │   # Raise trainable errors to the user with a message to restore                 │
│   667 │   │   │   # or configure `FailureConfig` in a new run.                                   │
│ ❱ 668 │   │   │   raise TrainingFailedError(                                                     │
│   669 │   │   │   │   "\n".join([restore_msg, TrainingFailedError._FAILURE_CONFIG_MSG])          │
│   670 │   │   │   ) from result.error                                                            │
│   671 │   │   return result                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application 
logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = TorchTrainer.restore("/home/ray/ray_results/TorchTrainer_2023-10-27_13-07-36")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or 
`max_failures = -1` for unlimited retries.
(RayTrainWorker pid=192, ip=10.0.0.19) Dataset yelp_review_full downloaded and prepared to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43. Subsequent calls will reuse this data.
(RayTrainWorker pid=192, ip=10.0.0.19) [13:08:27] WARNING  Parameter 'function'=<function            fingerprint.py:328
(RayTrainWorker pid=192, ip=10.0.0.19)                     train_func.<locals>.tokenize_function at                    
(RayTrainWorker pid=192, ip=10.0.0.19)                     0x7fbe6367c040> of the transform                            
(RayTrainWorker pid=192, ip=10.0.0.19)                     datasets.arrow_dataset.Dataset._map_singl                   
(RayTrainWorker pid=192, ip=10.0.0.19)                     e couldn't be hashed properly, a random                     
(RayTrainWorker pid=192, ip=10.0.0.19)                     hash was used instead. Make sure your                       
(RayTrainWorker pid=192, ip=10.0.0.19)                     transforms and parameters are                               
(RayTrainWorker pid=192, ip=10.0.0.19)                     serializable with pickle or dill for the                    
(RayTrainWorker pid=192, ip=10.0.0.19)                     dataset fingerprinting and caching to                       
(RayTrainWorker pid=192, ip=10.0.0.19)                     work. If you reuse this transform, the                      
(RayTrainWorker pid=192, ip=10.0.0.19)                     caching mechanism will consider it to be                    
(RayTrainWorker pid=192, ip=10.0.0.19)                     different from the previous calls and                       
(RayTrainWorker pid=192, ip=10.0.0.19)                     recompute everything. This warning is                       
(RayTrainWorker pid=192, ip=10.0.0.19)                     only showed once. Subsequent hashing                        
(RayTrainWorker pid=192, ip=10.0.0.19)                     failures won't be showed.                                   
Downloading:  91%|█████████ | 377M/416M [00:14<00:03, 13.4MB/s] [repeated 3x across cluster]
Downloading: 100%|██████████| 416M/416M [00:18<00:00, 23.4MB/s] [repeated 17x across cluster]
(RayTrainWorker pid=301, ip=10.0.0.78) - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=301, ip=10.0.0.78) - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=301, ip=10.0.0.78) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 14.3MB/s]
(RayTrainWorker pid=301, ip=10.0.0.78) The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=301, ip=10.0.0.78) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=301, ip=10.0.0.78)   warnings.warn(

Versions / Dependencies

Run 3 docker containers on different Cloud vms with gpus and all ports open: docker run -it --rm --gpus all --shm-size=10.05gb --network=host rayproject/ray-ml:2.7.1-py39-gpu

Head Node Start command: ray start --head --port=6379 --num-cpus=0 --num-gpus=0 --include-dashboard=true --dashboard-host=0.0.0.0 --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block

Worker Node Start command: ray start --address={public_head_ip}:6379 --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block


Like any cloud , each VM has 2 Network interfaces, one for internal ip and one for external We can also take in the docker network interface as third one in this setup. This might be something related to https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#ip-network-interfaces But need some help in debugging this.

Reproduction script

Same as official script but with 2 workers instead of 4.

import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    AutoModelForSequenceClassification,
)

import ray.train.huggingface.transformers
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

# [1] Encapsulate data preprocessing, training, and evaluation
# logic in a training function
# ============================================================
def train_func(config):
    # Datasets
    dataset = load_dataset("yelp_review_full")
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    small_train_dataset = dataset["train"].select(range(1000)).map(tokenize_function, batched=True)
    small_eval_dataset = dataset["test"].select(range(1000)).map(tokenize_function, batched=True)

    # Model
    model = AutoModelForSequenceClassification.from_pretrained(
        "bert-base-cased", num_labels=5
    )

    # Evaluation Metrics
    metric = evaluate.load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    # Hugging Face Trainer
    training_args = TrainingArguments(
        output_dir="test_trainer", evaluation_strategy="epoch", report_to="none"
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics,
    )

    # [2] Report Metrics and Checkpoints to Ray Train

    callback = ray.train.huggingface.transformers.RayTrainReportCallback()
    trainer.add_callback(callback)

    # [3] Prepare Transformers Trainer
    # ================================
    trainer = ray.train.huggingface.transformers.prepare_trainer(trainer)

    # Start Training
    trainer.train()

# [4] Define a Ray TorchTrainer to launch `train_func` on all workers
# ===================================================================
ray_trainer = TorchTrainer(
    train_func, scaling_config=ScalingConfig(num_workers=2, use_gpu=True)
)
ray_trainer.fit()

Issue Severity

High: It blocks me from completing my task.

matthewdeng commented 11 months ago

Can you try setting NCCL_DEBUG to INFO in your training script to see if any additional information is there?

def train_func(config):
   import os
   os.environ["NCCL_DEBUG"] = "INFO"
   ...
matthewdeng commented 11 months ago

Also if you know which interface you want to use, you can propagate the value across your cluster by adding this (example) to the top of your script:

import ray
ray.init(runtime_env={{'env_vars': {'NCCL_SOCKET_IFNAME': 'ens5'}}}`

There is some logic in Ray Train that sets this by default, but it may not match your setup.

smiraldr commented 11 months ago

Here are the logs after adding NCCL_DEBUG to INFO.

(base) root@smiral-0:~# python3 hf.py 
2023-10-27 23:58:19.762740: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-27 23:58:20.600691: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-27 23:58:20.600784: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-27 23:58:20.600794: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
--------------------------------------------------------------------------
                 Aim collects anonymous usage analytics.                 
                        Read how to opt-out here:                         
    https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
--------------------------------------------------------------------------
2023-10-27 23:58:25,053 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.0.0.30:6379...
2023-10-27 23:58:25,066 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at 10.0.0.30:8265 
2023-10-27 23:58:25,129 INFO tune.py:228 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2023-10-27 23:58:25,132 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /home/ray/ray_results/TorchTrainer_2023-10-27_23-58-25
To visualize your results with TensorBoard, run: `tensorboard --logdir /home/ray/ray_results/TorchTrainer_2023-10-27_23-58-25`
(TrainTrainable pid=249, ip=10.0.0.19) comet_ml is installed but `COMET_API_KEY` is not set.
(TrainTrainable pid=249, ip=10.0.0.19) 2023-10-27 23:58:29.393245: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(TrainTrainable pid=249, ip=10.0.0.19) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=249, ip=10.0.0.19) 2023-10-27 23:58:30.300658: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=249, ip=10.0.0.19) 2023-10-27 23:58:30.300742: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=249, ip=10.0.0.19) 2023-10-27 23:58:30.300749: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(TrainTrainable pid=249, ip=10.0.0.19) --------------------------------------------------------------------------
(TrainTrainable pid=249, ip=10.0.0.19)                  Aim collects anonymous usage analytics.                 
(TrainTrainable pid=249, ip=10.0.0.19)                         Read how to opt-out here:                         
(TrainTrainable pid=249, ip=10.0.0.19)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(TrainTrainable pid=249, ip=10.0.0.19) --------------------------------------------------------------------------

Training started without custom configuration.
(TorchTrainer pid=249, ip=10.0.0.19) Starting distributed worker processes: ['301 (10.0.0.19)', '192 (10.0.0.78)']
(RayTrainWorker pid=301, ip=10.0.0.19) Setting up process group for: env:// [rank=0, world_size=2]
(RayTrainWorker pid=301, ip=10.0.0.19) comet_ml is installed but `COMET_API_KEY` is not set.
(RayTrainWorker pid=301, ip=10.0.0.19) 2023-10-27 23:58:37.036058: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(RayTrainWorker pid=301, ip=10.0.0.19) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=301, ip=10.0.0.19) 2023-10-27 23:58:38.061058: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=301, ip=10.0.0.19) 2023-10-27 23:58:38.061139: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=301, ip=10.0.0.19) 2023-10-27 23:58:38.061146: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(RayTrainWorker pid=192, ip=10.0.0.78) -------------------------------------------------------------------------- [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=192, ip=10.0.0.78)                  Aim collects anonymous usage analytics.                 
(RayTrainWorker pid=192, ip=10.0.0.78)                         Read how to opt-out here:                         
(RayTrainWorker pid=192, ip=10.0.0.78)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(RayTrainWorker pid=192, ip=10.0.0.78) comet_ml is installed but `COMET_API_KEY` is not set.
Downloading builder script: 4.39kB [00:00, 11.9MB/s]                   
(RayTrainWorker pid=192, ip=10.0.0.78) 2023-10-27 23:58:37.081356: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(RayTrainWorker pid=192, ip=10.0.0.78) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=301, ip=10.0.0.19) Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43...
Downloading metadata: 2.13kB [00:00, 6.15MB/s]                   
Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]
(RayTrainWorker pid=192, ip=10.0.0.78) 2023-10-27 23:58:38.176368: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 2x across cluster]
(RayTrainWorker pid=192, ip=10.0.0.78) 2023-10-27 23:58:38.176376: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading data:   0%|          | 226k/196M [00:00<01:26, 2.26MB/s]
Downloading data:  92%|█████████▏| 181M/196M [00:03<00:00, 62.2MB/s]
Downloading data: 100%|██████████| 196M/196M [00:03<00:00, 50.7MB/s]
Downloading builder script: 4.39kB [00:00, 9.37MB/s]                   
Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]
Downloading metadata: 2.13kB [00:00, 15.3MB/s]                   
Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]
Downloading data:  90%|████████▉ | 176M/196M [00:03<00:00, 59.8MB/s] [repeated 52x across cluster]
Generating train split:   1%|          | 3263/650000 [00:00<00:19, 32627.93 examples/s]
Downloading data: 100%|██████████| 196M/196M [00:03<00:00, 50.2MB/s] [repeated 3x across cluster]
Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]
Generating train split:  25%|██▌       | 164261/650000 [00:05<00:15, 31205.57 examples/s] [repeated 96x across cluster]
Generating train split:  49%|████▉     | 317500/650000 [00:10<00:10, 32682.20 examples/s] [repeated 91x across cluster]
Generating train split:  66%|██████▌   | 427209/650000 [00:13<00:07, 29431.57 examples/s]
Generating train split:  71%|███████   | 458535/650000 [00:15<00:06, 31793.58 examples/s] [repeated 88x across cluster]
Generating train split:  91%|█████████▏| 593923/650000 [00:19<00:01, 32115.99 examples/s]
Generating train split:  92%|█████████▏| 597255/650000 [00:19<00:01, 32463.23 examples/s]
Generating train split:  92%|█████████▏| 600516/650000 [00:19<00:01, 31005.58 examples/s]
Generating train split:  93%|█████████▎| 603957/650000 [00:19<00:01, 31976.24 examples/s]
Generating train split:  93%|█████████▎| 607420/650000 [00:19<00:01, 32743.04 examples/s]
Generating train split:  91%|█████████ | 591739/650000 [00:19<00:01, 31008.19 examples/s] [repeated 78x across cluster]
Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]                  
Generating test split:   7%|▋         | 3484/50000 [00:00<00:01, 34823.93 examples/s]
Generating test split:  14%|█▍        | 6967/50000 [00:00<00:01, 34325.75 examples/s]
Generating test split:  21%|██        | 10401/50000 [00:00<00:01, 31644.46 examples/s]
Generating test split:  27%|██▋       | 13717/50000 [00:00<00:01, 32207.62 examples/s]
Generating test split:  34%|███▍      | 17075/50000 [00:00<00:01, 32684.78 examples/s]
Generating test split:  94%|█████████▍| 46891/50000 [00:01<00:00, 32385.22 examples/s]
(RayTrainWorker pid=301, ip=10.0.0.19) Dataset yelp_review_full downloaded and prepared to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43. Subsequent calls will reuse this data.
(RayTrainWorker pid=192, ip=10.0.0.78) Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43...
100%|██████████| 2/2 [00:00<00:00, 602.20it/s]                                        
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 113kB/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 4.35MB/s]
Downloading: 100%|██████████| 208k/208k [00:00<00:00, 14.8MB/s]
Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]
(RayTrainWorker pid=301, ip=10.0.0.19) [23:59:14] WARNING  Parameter 'function'=<function            fingerprint.py:328
(RayTrainWorker pid=301, ip=10.0.0.19)                     train_func.<locals>.tokenize_function at                    
(RayTrainWorker pid=301, ip=10.0.0.19)                     0x7f0e012c7940> of the transform                            
(RayTrainWorker pid=301, ip=10.0.0.19)                     datasets.arrow_dataset.Dataset._map_singl                   
(RayTrainWorker pid=301, ip=10.0.0.19)                     e couldn't be hashed properly, a random                     
(RayTrainWorker pid=301, ip=10.0.0.19)                     hash was used instead. Make sure your                       
(RayTrainWorker pid=301, ip=10.0.0.19)                     transforms and parameters are                               
(RayTrainWorker pid=301, ip=10.0.0.19)                     serializable with pickle or dill for the                    
(RayTrainWorker pid=301, ip=10.0.0.19)                     dataset fingerprinting and caching to                       
(RayTrainWorker pid=301, ip=10.0.0.19)                     work. If you reuse this transform, the                      
(RayTrainWorker pid=301, ip=10.0.0.19)                     caching mechanism will consider it to be                    
(RayTrainWorker pid=301, ip=10.0.0.19)                     different from the previous calls and                       
(RayTrainWorker pid=301, ip=10.0.0.19)                     recompute everything. This warning is                       
(RayTrainWorker pid=301, ip=10.0.0.19)                     only showed once. Subsequent hashing                        
(RayTrainWorker pid=301, ip=10.0.0.19)                     failures won't be showed.                                   
  0%|          | 0/1 [00:00<?, ?ba/s]) 
Downloading: 100%|██████████| 208k/208k [00:00<00:00, 13.6MB/s]
100%|██████████| 1/1 [00:00<00:00,  2.71ba/s]
100%|██████████| 1/1 [00:00<00:00,  4.06ba/s]
Generating train split: 100%|█████████▉| 647512/650000 [00:21<00:00, 30351.26 examples/s] [repeated 29x across cluster]
Downloading:   0%|          | 728k/416M [00:00<00:58, 7.45MB/s]
Downloading:   2%|▏         | 6.50M/416M [00:00<00:11, 38.8MB/s]
Downloading:   3%|▎         | 14.5M/416M [00:00<00:07, 59.2MB/s]
Downloading:   6%|▌         | 23.0M/416M [00:00<00:05, 71.0MB/s]
Downloading:   8%|▊         | 31.4M/416M [00:00<00:05, 77.3MB/s]
Downloading:  10%|▉         | 39.9M/416M [00:00<00:04, 81.3MB/s]
Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]                  
Generating test split:  88%|████████▊ | 43773/50000 [00:01<00:00, 31757.86 examples/s] [repeated 21x across cluster]
Generating test split:  94%|█████████▍| 47166/50000 [00:01<00:00, 32384.76 examples/s]
100%|██████████| 2/2 [00:00<00:00, 494.79it/s]                                        
Downloading: 100%|██████████| 426k/426k [00:00<00:00, 13.8MB/s] [repeated 4x across cluster]
Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s] [repeated 2x across cluster]
  0%|          | 0/1 [00:00<?, ?ba/s] [repeated 3x across cluster]
100%|██████████| 1/1 [00:00<00:00,  4.03ba/s] [repeated 2x across cluster]
Downloading:  92%|█████████▏| 384M/416M [00:04<00:00, 86.6MB/s]
Downloading:  94%|█████████▍| 393M/416M [00:04<00:00, 87.0MB/s]
Downloading:  96%|█████████▋| 401M/416M [00:04<00:00, 86.7MB/s]
Downloading: 100%|██████████| 416M/416M [00:05<00:00, 85.3MB/s]
Downloading:  23%|██▎       | 97.1M/416M [00:04<00:11, 28.4MB/s] [repeated 63x across cluster]
(RayTrainWorker pid=301, ip=10.0.0.19) Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
(RayTrainWorker pid=301, ip=10.0.0.19) - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=301, ip=10.0.0.19) - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=301, ip=10.0.0.19) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
(RayTrainWorker pid=301, ip=10.0.0.19) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 12.8MB/s]
(RayTrainWorker pid=301, ip=10.0.0.19) The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=301, ip=10.0.0.19) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=301, ip=10.0.0.19)   warnings.warn(
(RayTrainWorker pid=301, ip=10.0.0.19) smiral-2:301:351 [0] NCCL INFO Bootstrap : Using ens3:10.0.0.19<0>
(RayTrainWorker pid=301, ip=10.0.0.19) smiral-2:301:351 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayTrainWorker pid=301, ip=10.0.0.19) smiral-2:301:351 [0] NCCL INFO cudaDriverVersion 12030
(RayTrainWorker pid=301, ip=10.0.0.19) NCCL version 2.14.3+cuda11.8
(RayTrainWorker pid=192, ip=10.0.0.78) Dataset yelp_review_full downloaded and prepared to /home/ray/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/13c31a618ba62568ec8572a222a283dfc29a6517776a3ac5945fb508877dde43. Subsequent calls will reuse this data.
(RayTrainWorker pid=192, ip=10.0.0.78) [23:59:14] WARNING  Parameter 'function'=<function            fingerprint.py:328
(RayTrainWorker pid=192, ip=10.0.0.78)                     train_func.<locals>.tokenize_function at                    
(RayTrainWorker pid=192, ip=10.0.0.78)                     0x7f027d6c0040> of the transform                            
(RayTrainWorker pid=192, ip=10.0.0.78)                     datasets.arrow_dataset.Dataset._map_singl                   
(RayTrainWorker pid=192, ip=10.0.0.78)                     e couldn't be hashed properly, a random                     
(RayTrainWorker pid=192, ip=10.0.0.78)                     hash was used instead. Make sure your                       
(RayTrainWorker pid=192, ip=10.0.0.78)                     transforms and parameters are                               
(RayTrainWorker pid=192, ip=10.0.0.78)                     serializable with pickle or dill for the                    
(RayTrainWorker pid=192, ip=10.0.0.78)                     dataset fingerprinting and caching to                       
(RayTrainWorker pid=192, ip=10.0.0.78)                     work. If you reuse this transform, the                      
(RayTrainWorker pid=192, ip=10.0.0.78)                     caching mechanism will consider it to be                    
(RayTrainWorker pid=192, ip=10.0.0.78)                     different from the previous calls and                       
(RayTrainWorker pid=192, ip=10.0.0.78)                     recompute everything. This warning is                       
(RayTrainWorker pid=192, ip=10.0.0.78)                     only showed once. Subsequent hashing                        
(RayTrainWorker pid=192, ip=10.0.0.78)                     failures won't be showed.                                   
Downloading:  56%|█████▌    | 231M/416M [00:09<00:07, 25.9MB/s] [repeated 28x across cluster]
Downloading:  90%|█████████ | 374M/416M [00:14<00:02, 16.4MB/s] [repeated 29x across cluster]
Downloading:  92%|█████████▏| 383M/416M [00:15<00:02, 13.5MB/s]
Downloading:  93%|█████████▎| 385M/416M [00:15<00:02, 13.2MB/s]
Downloading:  94%|█████████▍| 391M/416M [00:16<00:01, 16.0MB/s]
Downloading:  94%|█████████▍| 392M/416M [00:16<00:01, 13.5MB/s]
Downloading:  96%|█████████▌| 398M/416M [00:16<00:01, 13.3MB/s]
Downloading:  96%|█████████▋| 400M/416M [00:16<00:01, 13.5MB/s]
Downloading:  98%|█████████▊| 406M/416M [00:17<00:00, 19.6MB/s]
Downloading:  98%|█████████▊| 409M/416M [00:17<00:00, 18.0MB/s]
Downloading: 100%|██████████| 416M/416M [00:17<00:00, 25.0MB/s]
(RayTrainWorker pid=192, ip=10.0.0.78) Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
(RayTrainWorker pid=192, ip=10.0.0.78) - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=192, ip=10.0.0.78) - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=192, ip=10.0.0.78) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
(RayTrainWorker pid=192, ip=10.0.0.78) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 12.0MB/s]
(RayTrainWorker pid=192, ip=10.0.0.78) The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
(RayTrainWorker pid=192, ip=10.0.0.78) /home/ray/anaconda3/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
(RayTrainWorker pid=192, ip=10.0.0.78)   warnings.warn(
2023-10-27 23:59:36,277 ERROR tune_controller.py:1502 -- Trial task failed for trial TorchTrainer_63c79_00000
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::_Inner.train() (pid=249, ip=10.0.0.19, actor_id=84a1163f1b5ae7a10e32c1e201000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 400, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(DistBackendError): ray::_RayTrainWorker__execute.get_next() (pid=301, ip=10.0.0.19, actor_id=a7efbd6c8cc674d17445260101000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f0f66c040a0>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/hf.py", line 67, in train_func
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1317, in train
    return inner_training_loop(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1402, in _inner_training_loop
    model = self._wrap_model(self.model_wrapped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1230, in _wrap_model
    model = nn.parallel.DistributedDataParallel(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)

Training errored after 0 iterations at 2023-10-27 23:59:36. Total running time: 1min 10s
Error file: /home/ray/ray_results/TorchTrainer_2023-10-27_23-58-25/TorchTrainer_63c79_00000_0_2023-10-27_23-58-25/error.txt

2023-10-27 23:59:36,290 ERROR tune.py:1139 -- Trials did not complete: [TorchTrainer_63c79_00000]
RayTaskError(DistBackendError): [36mray::_Inner.train()[39m (pid=249, ip=10.0.0.19, actor_id=84a1163f1b5ae7a10e32c1e201000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 400, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(DistBackendError): [36mray::_RayTrainWorker__execute.get_next()[39m (pid=301, ip=10.0.0.19, actor_id=a7efbd6c8cc674d17445260101000000, 
repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f0f66c040a0>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/hf.py", line 67, in train_func
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1317, in train
    return inner_training_loop(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1402, in _inner_training_loop
    model = self._wrap_model(self.model_wrapped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1230, in _wrap_model
    model = nn.parallel.DistributedDataParallel(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 674, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ray/hf.py:74 in <module>                                                                   │
│                                                                                                  │
│   71 ray_trainer = TorchTrainer(                                                                 │
│   72 │   train_func, scaling_config=ScalingConfig(num_workers=2, use_gpu=True)                   │
│   73 )                                                                                           │
│ ❱ 74 ray_trainer.fit()                                                                           │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.9/site-packages/ray/train/base_trainer.py:668 in fit             │
│                                                                                                  │
│   665 │   │   if result.error:                                                                   │
│   666 │   │   │   # Raise trainable errors to the user with a message to restore                 │
│   667 │   │   │   # or configure `FailureConfig` in a new run.                                   │
│ ❱ 668 │   │   │   raise TrainingFailedError(                                                     │
│   669 │   │   │   │   "\n".join([restore_msg, TrainingFailedError._FAILURE_CONFIG_MSG])          │
│   670 │   │   │   ) from result.error                                                            │
│   671 │   │   return result                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application 
logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = TorchTrainer.restore("/home/ray/ray_results/TorchTrainer_2023-10-27_23-58-25")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or 
`max_failures = -1` for unlimited retries.
Downloading:  91%|█████████ | 377M/416M [00:15<00:02, 15.7MB/s]
smiraldr commented 11 months ago

why does ens3 fail here, when it is implicitly inferred by NCCL? When I used ens3 explicitly as suggested by you - it seemed to work.

smiraldr commented 11 months ago

I have a question here , that if I am clustering different machines which are not from the same provider or have different config, then how could i know which interface works for each of the nodes? I wouldn't know the NCCL_SOCKET_IFNAME which would work by default here. Can you suggest how to find that ?

smiraldr commented 11 months ago

ip link show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:4b:d8:29 brd ff:ff:ff:ff:ff:ff 3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether 02:42:bd:78:6c:54 brd ff:ff:ff:ff:ff:ff 17: vethb29f745@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default link/ether 52:55:92:03:b3:07 brd ff:ff:ff:ff:ff:ff link-netnsid 3 19: vethd6c7a39@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default link/ether a2:f2:5c:2e:64:f1 brd ff:ff:ff:ff:ff:ff link-netnsid 5 32: netmaker: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/none

I used import ray ray.init(runtime_env={{'env_vars': {'NCCL_SOCKET_IFNAME': 'ens3'}}}` and it works.

Also what if I had another VM with following config: ip link show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether fa:16:3e:f0:7d:ae brd ff:ff:ff:ff:ff:ff altname enp0s3 altname ens3 3: br-996f4fc0cad0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default link/ether 02:42:c9:f2:61:e4 brd ff:ff:ff:ff:ff:ff 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default link/ether 02:42:28:f4:0d:44 brd ff:ff:ff:ff:ff:ff

  1. how would i set different nodes to use different network interfaces for NCCL?
  2. how would I know as a user which NCCL_SOCKET_IFNAME to use if the infra is abstracted?
smiraldr commented 11 months ago
2023-10-28 01:50:44.876886: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-28 01:50:45.785911: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-28 01:50:45.786080: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-28 01:50:45.786095: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
--------------------------------------------------------------------------
                 Aim collects anonymous usage analytics.                 
                        Read how to opt-out here:                         
    https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
--------------------------------------------------------------------------
2023-10-28 01:50:49,176 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.0.0.117:6379...
2023-10-28 01:50:49,191 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at 10.0.0.117:8265 
2023-10-28 01:50:49,277 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /home/ray/ray_results/TorchTrainer_2023-10-28_01-50-49
To visualize your results with TensorBoard, run: `tensorboard --logdir /home/ray/ray_results/TorchTrainer_2023-10-28_01-50-49`
2023-10-28 01:51:13,998 ERROR tune_controller.py:1502 -- Trial task failed for trial TorchTrainer_179ad_00000
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2549, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=600, ip=100.114.112.156, actor_id=5086820ed9f9668a13d6a31503000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 185, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 304, in setup
    setup_kwargs[k] = parameter_registry.get(prefix + k)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/registry.py", line 301, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0300000001e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*03000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.0.117) for more information about the Python worker failure.

Training errored after 0 iterations at 2023-10-28 01:51:14. Total running time: 24s
Error file: /home/ray/ray_results/TorchTrainer_2023-10-28_01-50-49/TorchTrainer_179ad_00000_0_2023-10-28_01-50-49/error.txt

2023-10-28 01:51:14,004 ERROR tune.py:1139 -- Trials did not complete: [TorchTrainer_179ad_00000]
2023-10-28 01:51:14,007 WARNING experiment_analysis.py:205 -- Failed to fetch metrics for 1 trial(s):
- TorchTrainer_179ad_00000: FileNotFoundError('Could not fetch metrics for TorchTrainer_179ad_00000: both result.json and progress.csv were not found at /home/ray/ray_results/TorchTrainer_2023-10-28_01-50-49/TorchTrainer_179ad_00000_0_2023-10-28_01-50-49')
(TrainTrainable pid=600, ip=10.0.0.196) Local object store memory usage:
(TrainTrainable pid=600, ip=10.0.0.196) 
(TrainTrainable pid=600, ip=10.0.0.196) (global lru) capacity: 18768886579
(TrainTrainable pid=600, ip=10.0.0.196) (global lru) used: 0%
(TrainTrainable pid=600, ip=10.0.0.196) (global lru) num objects: 0
(TrainTrainable pid=600, ip=10.0.0.196) (global lru) num evictions: 0
(TrainTrainable pid=600, ip=10.0.0.196) (global lru) bytes evicted: 0
(TrainTrainable pid=600, ip=10.0.0.196) 
(TrainTrainable pid=600, ip=10.0.0.196) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::_Inner.__init__() (pid=600, ip=100.114.112.156, actor_id=5086820ed9f9668a13d6a31503000000, repr=TorchTrainer)
(TrainTrainable pid=600, ip=10.0.0.196)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 185, in __init__
(TrainTrainable pid=600, ip=10.0.0.196)     self.setup(copy.deepcopy(self.config))
(TrainTrainable pid=600, ip=10.0.0.196)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 304, in setup
(TrainTrainable pid=600, ip=10.0.0.196)     setup_kwargs[k] = parameter_registry.get(prefix + k)
(TrainTrainable pid=600, ip=10.0.0.196)   File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/registry.py", line 301, in get
(TrainTrainable pid=600, ip=10.0.0.196)     return ray.get(self.references[k])
TuneError: Failure # 1 (occurred at 2023-10-28_01-51-14)
The actor died because of an error raised in its creation task, [36mray::_Inner.__init__()[39m (pid=600, ip=100.114.112.156, actor_id=5086820ed9f9668a13d6a31503000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 185, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 304, in setup
    setup_kwargs[k] = parameter_registry.get(prefix + k)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/registry.py", line 301, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0300000001e1f505. To see information about where this ObjectRef was created in Python, set the 
environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs 
(`/tmp/ray/session_latest/logs/*03000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.0.117) for more information about the Python worker failure.

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ray/hf.py:81 in <module>                                                                   │
│                                                                                                  │
│   78 ray_trainer = TorchTrainer(                                                                 │
│   79 │   train_func, scaling_config=ScalingConfig(num_workers=4, use_gpu=True)                   │
│   80 )                                                                                           │
│ ❱ 81 ray_trainer.fit()                                                                           │
│                                                                                                  │
│ /home/ray/anaconda3/lib/python3.9/site-packages/ray/train/base_trainer.py:668 in fit             │
│                                                                                                  │
│   665 │   │   if result.error:                                                                   │
│   666 │   │   │   # Raise trainable errors to the user with a message to restore                 │
│   667 │   │   │   # or configure `FailureConfig` in a new run.                                   │
│ ❱ 668 │   │   │   raise TrainingFailedError(                                                     │
│   669 │   │   │   │   "\n".join([restore_msg, TrainingFailedError._FAILURE_CONFIG_MSG])          │
│   670 │   │   │   ) from result.error                                                            │
│   671 │   │   return result                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application 
logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = TorchTrainer.restore("/home/ray/ray_results/TorchTrainer_2023-10-28_01-50-49")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or 
`max_failures = -1` for unlimited retries.
(TrainTrainable pid=600, ip=10.0.0.196) ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0300000001e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
(TrainTrainable pid=600, ip=10.0.0.196) 
(TrainTrainable pid=600, ip=10.0.0.196) The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*03000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.0.117) for more information about the Python worker failure.

Is there a way that we can set for ray to use a specific network interface as well? not sure what happened here.

matthewdeng commented 11 months ago

For the NCCL_SOCKET_IFNAME questions, I'm not too sure - these questions are better suited for NCCL.

For the Ray issue, this is unrelated to NCCL. I've seen this happen before if the workers cannot properly connect to the head node. Is the issue reproducible?

smiraldr commented 11 months ago

Thanks for the help on NCCL , i'll be able to figure this out. @matthewdeng I was trying to connect 2 vms which are not in the same geographical area - what is the limit and best practices for creating a ray cluster here? I've also tried connecting them over a single private network but had this issue. Remote ray tasks seem to work fine but this only happens when I launch a ray train job

smiraldr commented 11 months ago

Hopefully this helps?

/tmp/ray/session_latest/logs/03000000ffffffffffffffffffffffffffffffffffffffffffffffff Output:

[2023-10-30 08:48:51,644 I 7500 7500] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 7500
[2023-10-30 08:48:51,646 I 7500 7500] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-10-30 08:48:51,652 I 7500 7500] grpc_server.cc:129: driver server started, listening on port 10006.
[2023-10-30 08:48:51,659 I 7500 7500] core_worker.cc:227: Initializing worker at address: 192.168.16.197:10006, worker ID 06000000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 0e6da528aae9f90cc09fb22babb40360918121dddd614df018b77792
[2023-10-30 08:48:51,660 I 7500 7500] task_event_buffer.cc:190: Reporting task events to GCS every 1000ms.
[2023-10-30 08:48:51,661 I 7500 7597] core_worker.cc:570: Event stats:

Global stats: 7 total (4 active)
Queueing time: mean = 21.646 us, max = 100.401 us, min = 21.257 us, total = 151.520 us
Execution time:  mean = 31.595 us, total = 221.165 us
Event stats:
    PeriodicalRunner.RunFnPeriodically - 2 total (1 active, 1 running), CPU time: mean = 9.509 us, total = 19.018 us
    UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 167.147 us, total = 167.147 us
    WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 35.000 us, total = 35.000 us

-----------------
Task Event stats:

IO Service Stats:

Global stats: 2 total (1 active)
Queueing time: mean = 4.523 us, max = 9.046 us, min = 9.046 us, total = 9.046 us
Execution time:  mean = 6.109 us, total = 12.219 us
Event stats:
    CoreWorker.deadline_timer.flush_task_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
    PeriodicalRunner.RunFnPeriodically - 1 total (0 active), CPU time: mean = 12.219 us, total = 12.219 us
Other Stats:
    grpc_in_progress:0
    current number of task events in buffer: 1
    total task events sent: 0 MiB
    total number of task events sent: 0
    num status task events dropped: 0
    num profile task events dropped: 0

[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 75144c5dda31c02a95c6136be651b52e49c1082967eb32e2ef0abe59, IsAlive = 1
[2023-10-30 08:48:51,661 I 7500 7597] core_worker.cc:4262: Number of alive nodes:1
[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 7fb2a236e768fc2ea3f19adf867b211e10ca0af53406a58d024f6312, IsAlive = 1
[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 0e6da528aae9f90cc09fb22babb40360918121dddd614df018b77792, IsAlive = 1
[2023-10-30 08:48:51,661 I 7500 7597] accessor.cc:611: Received notification for node id = 209e8c99aa57aca3e213ac1a924d87c369511a4bf7193d9bc03a6afd, IsAlive = 0
[2023-10-30 08:48:51,661 I 7500 7597] core_worker.cc:306: Node failure from 209e8c99aa57aca3e213ac1a924d87c369511a4bf7193d9bc03a6afd. All objects pinned on that node will be lost if object reconstruction is not enabled.
[2023-10-30 08:48:51,662 I 7500 7500] event.cc:234: Set ray event level to warning
[2023-10-30 08:48:51,662 I 7500 7500] event.cc:342: Ray Event initialized for CORE_WORKER
[2023-10-30 08:48:52,365 I 7500 7500] core_worker.cc:2131: Submitting Placement Group creation to GCS: 6b1e37d945742c16a5b63039cb9906000000
[2023-10-30 08:48:52,680 I 7500 7597] direct_task_transport.cc:289: Connecting to raylet 7fb2a236e768fc2ea3f19adf867b211e10ca0af53406a58d024f6312
[2023-10-30 08:48:54,405 I 7500 7500] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 5a3c8a5d9e25e895f3fa826a06000000
[2023-10-30 08:48:54,409 I 7500 7597] actor_manager.cc:214: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:48:54,410 I 7500 7597] actor_manager.cc:214: received notification on actor, state: DEPENDENCIES_UNREADY, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:48:54,410 I 7500 7597] actor_manager.cc:214: received notification on actor, state: PENDING_CREATION, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: , port: 0, worker_id: NIL_ID, raylet_id: NIL_ID, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:49:17,327 I 7500 7597] direct_task_transport.cc:55: Actor creation failed and we will not be retrying the creation task, actor id = 5a3c8a5d9e25e895f3fa826a06000000, task id = ffffffffffffffff5a3c8a5d9e25e895f3fa826a06000000
[2023-10-30 08:49:17,327 I 7500 7597] actor_manager.cc:214: received notification on actor, state: ALIVE, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: 100.121.244.176, port: 10006, worker_id: 085b533d0b5e049ca3e62b56977c27a2db28a7098460cb17c7a6469c, raylet_id: 75144c5dda31c02a95c6136be651b52e49c1082967eb32e2ef0abe59, num_restarts: 0, death context type=CONTEXT_NOT_SET
[2023-10-30 08:49:17,328 I 7500 7597] direct_actor_task_submitter.cc:237: Connecting to actor 5a3c8a5d9e25e895f3fa826a06000000 at worker 085b533d0b5e049ca3e62b56977c27a2db28a7098460cb17c7a6469c
[2023-10-30 08:49:17,462 I 7500 7597] actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: 5a3c8a5d9e25e895f3fa826a06000000, ip address: 100.121.244.176, port: 10006, worker_id: 085b533d0b5e049ca3e62b56977c27a2db28a7098460cb17c7a6469c, raylet_id: 75144c5dda31c02a95c6136be651b52e49c1082967eb32e2ef0abe59, num_restarts: 0, death context type=CreationTaskFailureContext
[2023-10-30 08:49:17,462 I 7500 7597] direct_actor_task_submitter.cc:287: Failing pending tasks for actor 5a3c8a5d9e25e895f3fa826a06000000 because the actor is already dead.
[2023-10-30 08:49:17,463 I 7500 7597] task_manager.cc:829: task 4083cef8db49d2ff5a3c8a5d9e25e895f3fa826a06000000 retries left: 0, oom retries left: 0, task failed due to oom: 0
[2023-10-30 08:49:17,463 I 7500 7597] task_manager.cc:845: No retries left for task 4083cef8db49d2ff5a3c8a5d9e25e895f3fa826a06000000, not going to resubmit.
[2023-10-30 08:49:17,463 I 7500 7597] task_manager.cc:903: Task failed: IOError: Fail all inflight tasks due to actor state change.: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.tune.trainable.util, class_name=with_parameters.<locals>._Inner, function_name=__ray_ready__, function_hash=}, task_id=4083cef8db49d2ff5a3c8a5d9e25e895f3fa826a06000000, task_name=_Inner.__ray_ready__, job_id=06000000, num_args=0, num_returns=1, depth=1, attempt_number=0, actor_task_spec={actor_id=5a3c8a5d9e25e895f3fa826a06000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff06000000, actor_counter=0}, runtime_env_hash=-898807589, eager_install=1, setup_timeout_seconds=600
[2023-10-30 08:49:18,111 I 7500 7500] core_worker.cc:718: Disconnecting to the raylet.
[2023-10-30 08:49:18,111 I 7500 7500] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2023-10-30 08:49:18,111 I 7500 7500] core_worker.cc:641: Shutting down a core worker.
[2023-10-30 08:49:18,111 I 7500 7500] task_event_buffer.cc:201: Shutting down TaskEventBuffer.
[2023-10-30 08:49:18,111 I 7500 7604] task_event_buffer.cc:183: Task event buffer io service stopped.
[2023-10-30 08:49:18,111 I 7500 7500] core_worker.cc:667: Disconnecting a GCS client.
[2023-10-30 08:49:18,111 I 7500 7597] core_worker.cc:884: Core worker main io service stopped.
[2023-10-30 08:49:18,112 I 7500 7500] core_worker.cc:671: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2023-10-30 08:49:18,118 I 7500 7500] core_worker.cc:684: Core worker ready to be deallocated.
[2023-10-30 08:49:18,118 I 7500 7500] core_worker.cc:632: Core worker is destructed
[2023-10-30 08:49:18,118 I 7500 7500] task_event_buffer.cc:201: Shutting down TaskEventBuffer.
[2023-10-30 08:49:18,119 W 7500 7607] server_call.h:324: [1] Not sending reply because executor stopped.
[2023-10-30 08:49:18,120 I 7500 7500] core_worker_process.cc:148: Destructing CoreWorkerProcessImpl. pid: 7500
[2023-10-30 08:49:18,120 I 7500 7500] io_service_pool.cc:47: IOServicePool is stopped.
[2023-10-30 08:49:18,165 I 7500 7500] stats.h:128: Stats module has shutdown.
matthewdeng commented 11 months ago

@architkulkarni do you have any ideas on the issue above? To summarize:

Head Node Start command:

ray start --head --port=6379 --num-cpus=0 --num-gpus=0 --include-dashboard=true --dashboard-host=0.0.0.0 --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block

Worker Node Start command:

ray start --address={public_head_ip}:6379 --node-manager-port=1915 --object-manager-port=1916 --dashboard-agent-grpc-port=1917 --dashboard-agent-listen-port=1918 --disable-usage-stats --block

Ray Train fails with the following:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/registry.py", line 301, in get
    return ray.get(self.references[k])
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0300000001e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*03000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.0.117) for more information about the Python worker failure.

Here's a related thread where the worker node was not actually connecting to the cluster.

architkulkarni commented 11 months ago

Not sure... I think you can use ray status or ray.nodes() or ray.cluster_resources() to determine if all the nodes are actually connected to the cluster. If a node got disconnected during the run, I would expect there to be some mention of that in the logs

smiraldr commented 11 months ago

ray status seems to show all resources as expected. even after running the training job and failing with ray object owner exited - ray status still works and shows all resources. None of the worker nodes get evicted or die. running a light weight remote task also works like returning the ips of nodes, this error only seems to pop up when I run a train job.

smiraldr commented 11 months ago

Had another variation here for the same ray train code:

base) root@smiral-0:~# python3 hf.py 
2023-10-31 10:09:34.713036: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-31 10:09:35.485444: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-31 10:09:35.485523: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-10-31 10:09:35.485529: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
comet_ml is installed but `COMET_API_KEY` is not set.
2023-10-31 10:09:38,678 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.0.0.30:6379...
2023-10-31 10:09:38,685 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at 10.0.0.30:8265 
2023-10-31 10:09:38,734 INFO tune.py:228 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Trainer(...)`.
2023-10-31 10:09:38,737 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /home/ray/ray_results/TorchTrainer_2023-10-31_10-09-38
To visualize your results with TensorBoard, run: `tensorboard --logdir /home/ray/ray_results/TorchTrainer_2023-10-31_10-09-38`

(raylet, ip=10.0.0.19) [2023-10-31 10:10:39,130 E 295 295] (raylet) worker_pool.cc:548: Some workers of the worker process(419) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet, ip=10.0.0.19) [2023-10-31 10:10:39,132 E 295 295] (raylet) worker_pool.cc:548: Some workers of the worker process(420) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet, ip=10.0.0.19) [2023-10-31 10:10:39,130 E 295 295] (raylet) worker_pool.cc:548: Some workers of the worker process(419) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet, ip=10.0.0.19) [2023-10-31 10:10:39,132 E 295 295] (raylet) worker_pool.cc:548: Some workers of the worker process(420) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.106.52.141, port: 63107, Suiciding...
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.78) 
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.106.52.141, port: 63107, Suiciding...
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.78) 
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.106.52.141, port: 63107, Suiciding...
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.78) 
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.106.52.141, port: 63107, Suiciding...
(raylet, ip=10.0.0.78) [2023-10-31 10:11:22,088 E 293 293] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.78) 
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,229 E 295 295] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.98.192.60, port: 46287, Suiciding...
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,229 E 295 295] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.19) 
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,230 E 295 295] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.98.192.60, port: 46287, Suiciding...
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,230 E 295 295] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.19) 
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,229 E 295 295] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.98.192.60, port: 46287, Suiciding...
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,229 E 295 295] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.19) 
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,230 E 295 295] (raylet) runtime_env_agent_client.cc:256: Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection timed out, address: 100.98.192.60, port: 46287, Suiciding...
(raylet, ip=10.0.0.19) [2023-10-31 10:11:22,230 E 295 295] (raylet) runtime_env_agent_client.cc:216: The raylet exited immediately because the runtime env agent timed out when Raylet try to connect to it. This can happen because the runtime env agent was never started, or is listening to the wrong port. Read the log `cat /tmp/ray/session_latest/logs/runtime_env_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet, ip=10.0.0.19)