ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

TypeError in a SLURM environment due to internal API break #235

Closed dcfidalgo closed 1 year ago

dcfidalgo commented 1 year ago

Using the master branch of ray-lightning with pytorch-lightning v1.6 in a SLURM environment leads to following exception:

ray.exceptions.RayTaskError(TypeError): ray::ImplicitFunc.train() (pid=117539, ip=10.181.76.37, repr=train)
  File ".../lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 335, in entrypoint
    return self._trainable_func(
  File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
    output = fn()
  File ".../random_search.py", line 122, in train
    trainer = Trainer(
  File ".../lib/python3.9/site-packages/pytorch_lightning/utilities/argparse.py", line 339, in insert_env_defaults
    return fn(self, **kwargs)
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in __init__
    self._accelerator_connector = AcceleratorConnector(
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 204, in __init__
    self.cluster_environment: ClusterEnvironment = self._choose_and_init_cluster_environment()
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 549, in _choose_and_init_cluster_environment
    if self._is_slurm_managing_tasks():
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 562, in _is_slurm_managing_tasks
    total_requested_devices = len(self._parallel_devices) * self._num_nodes_flag
TypeError: object of type 'NoneType' has no len()

The _GPUAccelerator.get_parallel_devices method breaks the internal Pytorch Lightning API by returning None in some cases, is this intentional? Returning an empty List instead of None fixes my issue, but I don't know if None is required in other ray-lightning use cases.

I would be more than happy to provide a PR if you think the fix is fine.

Thank you for this very convenient package and keep up the fantastic work!

richardliaw commented 1 year ago

Creating a PR would be much appreciated!

dcfidalgo commented 1 year ago

@amogkam Thanks a lot for the PR! I was about to tackle this ... :)