Using the master branch of ray-lightning with pytorch-lightning v1.6 in a SLURM environment leads to following exception:
ray.exceptions.RayTaskError(TypeError): ray::ImplicitFunc.train() (pid=117539, ip=10.181.76.37, repr=train)
File ".../lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 367, in train
raise skipped from exception_cause(skipped)
File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 335, in entrypoint
return self._trainable_func(
File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
output = fn()
File ".../random_search.py", line 122, in train
trainer = Trainer(
File ".../lib/python3.9/site-packages/pytorch_lightning/utilities/argparse.py", line 339, in insert_env_defaults
return fn(self, **kwargs)
File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in __init__
self._accelerator_connector = AcceleratorConnector(
File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 204, in __init__
self.cluster_environment: ClusterEnvironment = self._choose_and_init_cluster_environment()
File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 549, in _choose_and_init_cluster_environment
if self._is_slurm_managing_tasks():
File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 562, in _is_slurm_managing_tasks
total_requested_devices = len(self._parallel_devices) * self._num_nodes_flag
TypeError: object of type 'NoneType' has no len()
The _GPUAccelerator.get_parallel_devices method breaks the internal Pytorch Lightning API by returning None in some cases, is this intentional? Returning an empty List instead of None fixes my issue, but I don't know if None is required in other ray-lightning use cases.
I would be more than happy to provide a PR if you think the fix is fine.
Thank you for this very convenient package and keep up the fantastic work!
Using the master branch of
ray-lightning
with pytorch-lightning v1.6 in a SLURM environment leads to following exception:The
_GPUAccelerator.get_parallel_devices
method breaks the internal Pytorch Lightning API by returningNone
in some cases, is this intentional? Returning an empty List instead ofNone
fixes my issue, but I don't know if None is required in other ray-lightning use cases.I would be more than happy to provide a PR if you think the fix is fine.
Thank you for this very convenient package and keep up the fantastic work!