[Tune] Ray Tune Trials TorchTrainer failing

f2010126 commented 1 year ago

What happened + What you expected to happen

I'm using Ray Tune 3.0.0 with TorchTrainer and Pytorch Lightning to optimise a Bert model. Frequently, I get a CUDA failure.

.../lib/python3.10/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
(RayTrainWorker pid=7253, ip=10.5.166.183)   warnings.warn("Can't initialize NVML")
....
raise AssertionError("Invalid device id")
AssertionError: Invalid device id

It has happened for 5 of 15 Trials I ran. I am running on a Slurm cluster and launch my jobs with the scripts here

Versions / Dependencies

torch 2.0.1 ray 3.0.0.dev0 pytorch-lightning 2.0.8 transformers 4.32.0 Linux: Ubuntu 22.04.2 LTS CUDA Version: 12.1

Reproduction script

def objective_torch_trainer(config, data_dir=os.path.join(os.getcwd(), "testing_data")):
    dm = get_datamodule(task_name="sentilex", model_name_or_path=config['model_name_or_path'],
                        max_seq_length=config['max_seq_length'],
                        train_batch_size=config['per_device_train_batch_size'],
                        eval_batch_size=config['per_device_eval_batch_size'], data_dir=data_dir)
    dm.setup("fit")
    model = AshaTransformer(config=config, num_labels=dm.task_metadata['num_labels'])
    ckpt_report_callback = RayTrainReportCallback()
    trainer = pl.Trainer(
        max_epochs=config['num_epochs'],
        # If fractional GPUs passed in, convert to int.
        devices='auto',
        accelerator='auto',
        enable_progress_bar=True,
        max_time="00:12:00:00",  # give each run a time limit
        val_check_interval=0.5,  # check validation set 4 times during a training epoch
        strategy=RayDDPStrategy(),
        plugins=[RayLightningEnvironment()],
        callbacks=[ckpt_report_callback])

    # Validate your Lightning trainer configuration
    trainer = prepare_trainer(trainer)
    trainer.fit(model, datamodule=dm)

def tune_func_torch_trainer(num_samples=10, num_epochs=10, exp_name="torch_transform"):
    scheduler = ASHAScheduler(max_t=num_epochs,
                              grace_period=1,
                              reduction_factor=2)

    train_fn_with_parameters = tune.with_parameters(objective_torch_trainer, data_dir=os.path.join(os.getcwd(), "testing_data"))
    scaling_config = ray.train.ScalingConfig(
            # no of other nodes?
            num_workers=8 use_gpu=True, resources_per_worker={"CPU": 2, "GPU": 1}
        )

    ray_trainer = TorchTrainer(
        train_fn_with_parameters,
        scaling_config=scaling_config,
    )

    tuner = tune.Tuner(ray_trainer,
                       tune_config=tune.TuneConfig(
                           metric="ptl/val_accuracy",
                           mode="max",
                           scheduler=scheduler,
                           num_samples=num_samples,
                       ),
                       run_config=ray.train.RunConfig(
                           name=exp_name,
                           verbose=2,
                           storage_path=result_dir,
                           log_to_file=True,
                           checkpoint_config=ray.train.CheckpointConfig(
                               num_to_keep=3,
                               checkpoint_score_attribute="ptl/val_accuracy",
                               checkpoint_score_order="max",
                           ),
                       ),
                       param_space={"train_loop_config": hpo_config},
                       )
    results = tuner.fit()

    print("Best hyperparameters found were: ", results.get_best_result().config)

Issue Severity

High: It blocks me from completing my task.

krfricke commented 1 year ago

Is this on one machine with 8 GPUs, or 8 machines with 1 GPU (or something else)?

Which kind of GPUs are you using?

f2010126 commented 1 year ago

I'm using SLURM nodes. Each Node had 8 GPUs and 64 CPUs available to it ; I had 2 Nodes. The GPUs are RTX 2080 Ti.

Output from the slurm error Log

``` bash [2023-09-15 00:42:21,354 I 2007 2007] global_state_accessor.cc:368: This node has an IP address of 10.5.166.180, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container. 2023-09-15 00:42:33,629 INFO worker.py:1465 -- Connecting to existing Ray cluster at address: 10.5.166.177:6379... DEBUG:filelock:Attempting to acquire lock 140380820045936 on /tmp/ray/session_2023-09-15_00-41-50_138801_2959/node_ip_address.json.lock DEBUG:filelock:Lock 140380820045936 acquired on /tmp/ray/session_2023-09-15_00-41-50_138801_2959/node_ip_address.json.lock DEBUG:filelock:Lock 140380820046272 acquired on /tmp/ray/session_2023-09-15_00-41-50_138801_2959/ports_by_node.json.lock DEBUG:filelock:Attempting to release lock 140380820046272 on /tmp/ray/session_2023-09-15_00-41-50_138801_2959/ports_by_node.json.lock DEBUG:filelock:Lock 140380820046272 released on /tmp/ray/session_2023-09-15_00-41-50_138801_2959/ports_by_node.json.lock 2023-09-15 00:42:33,640 INFO worker.py:1640 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 2023-09-15 00:42:39,067 INFO tuner_internal.py:466 -- A `RunConfig` was passed to both the `Tuner` and the `TorchTrainer`. The run config passed to the `Tuner` is the one that will be used. 2023-09-15 00:42:39,149 INFO tune.py:654 -- [output] This will use the new output engine with verbosity 2. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949 DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/tuner.pkl DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/basic-variant-state-2023-09-15_00-42-39.json DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/experiment_state-2023-09-15_00-42-39.json (TorchTrainer pid=7841) Starting distributed worker processes: ['8006 (10.5.166.177)', '8007 (10.5.166.177)', '8008 (10.5.166.177)', '8009 (10.5.166.177)', '8010 (10.5.166.177)', '8011 (10.5.166.177)', '8012 (10.5.166.177)', '8013 (10.5.166.177)'] (RayTrainWorker pid=8006) Setting up process group for: env:// [rank=0, world_size=8] (TorchTrainer pid=2376, ip=10.5.166.180) Starting distributed worker processes: ['2501 (10.5.166.180)', '2502 (10.5.166.180)', '2503 (10.5.166.180)', '2504 (10.5.166.180)', '2505 (10.5.166.180)', '2506 (10.5.166.180)', '2507 (10.5.166.180)', '2508 (10.5.166.180)'] (RayTrainWorker pid=2504, ip=10.5.166.180) [rank: 3] Global seed set to 1234 (RayTrainWorker pid=2501, ip=10.5.166.180) Setting up process group for: env:// [rank=0, world_size=8] (RayTrainWorker pid=2504, ip=10.5.166.180) LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/TorchTrainer_02133_00000_Bys2a/Tokenised.lock DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/TorchTrainer_02133_00000_Bys2a/params.pkl DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/TorchTrainer_02133_00000_Bys2a/params.json DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/TorchTrainer_02133_00000_Bys2a/events.out.tfevents.1694731368.dlcgpu17 DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/TorchTrainer_02133_00000_Bys2a/ray_results_log/torch_trainer_logs/csv_torch_trainer_logs/metrics.csv DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/TorchTrainer_02133_00000_Bys2a/ray_results_log/torch_trainer_logs/csv_torch_trainer_logs/checkpoints/epoch=2-step=672.ckpt DEBUG:fsspec.local:open file: /XXXXXXXXXXX/ray_results/10DataLocSentiLex/TorchTrainer_02133_00000_Bys2a/progress.csv (RayTrainWorker pid=9232) Setting up process group for: env:// [rank=0, world_size=8] (RayTrainWorker pid=9237) LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] [repeated 7x across cluster] (RayTrainWorker pid=9232) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/work/dlclarge1/XXXXXXXXXXX/ray_cluster_test/ray_results_result/10DataLocSentiLex/TorchTrainer_02133_00002_gBsbc/checkpoint_000000) [repeated 7x across cluster] (TorchTrainer pid=10349) Starting distributed worker processes: ['10477 (10.5.166.177)', '10478 (10.5.166.177)', '10479 (10.5.166.177)', '10480 (10.5.166.177)', '10481 (10.5.166.177)', '10482 (10.5.166.177)', '10483 (10.5.166.177)', '10484 (10.5.166.177)'] (RayTrainWorker pid=10477) Setting up process group for: env:// [rank=0, world_size=8] Map: 0%| | 0/3974 [00:00) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 145, in _check_capability capability = get_device_capability(d) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability prop = get_device_properties(device) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] RuntimeError: CUDA error: unknown error Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. The above exception was the direct cause of the following exception: ray::_RayTrainWorker__execute.get_next() (pid=10481, ip=10.5.166.177, actor_id=051b2da48eb0dbfbdbc6f0a701000000, repr=) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute raise skipped from exception_cause(skipped) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper train_func(*args, **kwargs) prop = get_device_properties(device) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties _lazy_init() # will define _get_device_properties File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 264, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: CUDA error: unknown error Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. CUDA call was originally invoked at: [' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 278, in \n worker.main_loop()\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/worker.py", line 783, in main_loop\n self.core_worker.run_task_loop()\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/worker.py", line 729, in deserialize_objects\n return context.deserialize_objects(data_metadata_pairs, object_refs)\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/serialization.py", line 404, in deserialize_objects\n obj = self._deserialize_object(data, metadata, object_ref)\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/serialization.py", line 270, in _deserialize_object\n return self._deserialize_msgpack_data(data, metadata_fields)\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/serialization.py", line 225, in _deserialize_msgpack_data\n python_objects = self._deserialize_pickle5_data(pickle5_data)\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/serialization.py", line 215, in _deserialize_pickle5_data\n obj = pickle.loads(in_band)\n', ' File "", line 1027, in _find_and_load\n', ' File "", line 992, in _find_and_load_unlocked\n', ' File "", line 241, in _call_with_frames_removed\n', ' File "", line 1027, in _find_and_load\n', ' File "", line 1006, in _find_and_load_unlocked\n', ' File "", line 688, in _load_unlocked\n', ' File "", line 883, in exec_module\n', ' File "", line 241, in _call_with_frames_removed\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/train/torch/__init__.py", line 3, in \n import torch # noqa: F401\n', ' File "", line 1027, in _find_and_load\n', ' File "", line 1006, in _find_and_load_unlocked\n', ' File "", line 688, in _load_unlocked\n', ' File "", line 883, in exec_module\n', ' File "", line 241, in _call_with_frames_removed\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/__init__.py", line 1146, in \n _C._initExtension(manager_path())\n', ' File "", line 1027, in _find_and_load\n', ' File "", line 1006, in _find_and_load_unlocked\n', ' File "", line 688, in _load_unlocked\n', ' File "", line 883, in exec_module\n', ' File "", line 241, in _call_with_frames_removed\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 197, in \n _lazy_call(_check_capability)\n', ' File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 195, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n'] (TorchTrainer pid=11391) Starting distributed worker processes: ['11499 (10.5.166.177)', '11500 (10.5.166.177)', '11501 (10.5.166.177)', '11502 (10.5.166.177)', '11503 (10.5.166.177)', '11504 (10.5.166.177)', '11505 (10.5.166.177)', '11506 (10.5.166.177)'] 2023-09-15 00:49:58,967 ERROR tune_controller.py:1502 -- Trial task failed for trial TorchTrainer_02133_00004_TyQxl Traceback (most recent call last): File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/pytorch_lightning/accelerators/cuda.py", line 44, in setup_device _check_cuda_matmul_precision(device) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/lightning_fabric/accelerators/cuda.py", line 349, in _check_cuda_matmul_precision major, _ = torch.cuda.get_device_capability(device) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability prop = get_device_properties(device) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 398, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id (TorchTrainer pid=12403) Starting distributed worker processes: ['12536 (10.5.166.177)', '12537 (10.5.166.177)', '12538 (10.5.166.177)', '12539 (10.5.166.177)', '12540 (10.5.166.177)', '12541 (10.5.166.177)', '12542 (10.5.166.177)', '12543 (10.5.166.177)'] (RayTrainWorker pid=11500) [rank: 1] Global seed set to 42 [repeated 4x across cluster] (RayTrainWorker pid=12536) Setting up process group for: env:// [rank=0, world_size=8] (RayTrainWorker pid=12542) /XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML (RayTrainWorker pid=12542) warnings.warn("Can't initialize NVML") (RayTrainWorker pid=12542) [rank: 6] Global seed set to 1234 (RayTrainWorker pid=12541) /XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML [repeated 7x across cluster] (RayTrainWorker pid=12541) warnings.warn("Can't initialize NVML") [repeated 7x across cluster] (RayTrainWorker pid=12536) GPU available: True (cuda), used: True (RayTrainWorker pid=12536) TPU available: False, using: 0 TPU cores (RayTrainWorker pid=12536) IPU available: False, using: 0 IPUs (RayTrainWorker pid=12536) HPU available: False, using: 0 HPUs 2023-09-15 00:50:27,045 ERROR tune_controller.py:1502 -- Trial task failed for trial TorchTrainer_02133_00005_WqX4q Traceback (most recent call last): File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, **kwargs) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/_private/worker.py", line 2554, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AssertionError): ray::_Inner.train() (pid=12403, ip=10.5.166.177, actor_id=8973239945bd93cc9cfd39cb01000000, repr=TorchTrainer) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 400, in train raise skipped from exception_cause(skipped) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/train/_internal/utils.py", line 54, in check_for_failure ray.get(object_ref) prop = get_device_properties(device) File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 398, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id (TorchTrainer pid=13442) Starting distributed worker processes: ['13554 (10.5.166.177)', '13555 (10.5.166.177)', '13556 (10.5.166.177)', '13557 (10.5.166.177)', '13558 (10.5.166.177)', '13559 (10.5.166.177)', '13560 (10.5.166.177)', '13561 (10.5.166.177)'] 2023-09-15 00:50:53,563 ERROR tune_controller.py:1502 -- Trial task failed for trial TorchTrainer_02133_00006_ctZ2i Traceback (most recent call last): File "/XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) (TorchTrainer pid=14458) Starting distributed worker processes: ['14588 (10.5.166.177)', '14589 (10.5.166.177)', '14590 (10.5.166.177)', '14591 (10.5.166.177)', '14592 (10.5.166.177)', '14593 (10.5.166.177)', '14594 (10.5.166.177)', '14595 (10.5.166.177)'] (RayTrainWorker pid=13559) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/bert-base-german-cased-oldvocab and are newly initialized: ['classifier.bias', 'classifier.weight'] [repeated 2x across cluster] (RayTrainWorker pid=13559) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [repeated 7x across cluster] (RayTrainWorker pid=13555) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at deepset/bert-base-german-cased-oldvocab and are newly initialized: ['classifier.weight', 'classifier.bias'] [repeated 4x across cluster] (RayTrainWorker pid=13561) [rank: 7] Global seed set to 1234 [repeated 6x across cluster] (RayTrainWorker pid=14588) Setting up process group for: env:// [rank=0, world_size=8] (RayTrainWorker pid=14588) /XXXXXXXXXXX/XXXXXXXXXXX/lib/python3.10/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML (RayTrainWorker pid=14588) warnings.warn("Can't initialize NVML") Map: 0%| | 0/3974 [00:00

f2010126 commented 1 year ago

This issue shows up intermittently. The only workaround I have is to restart the experiment.

ray-project / ray