intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.26k stars 166 forks source link

Erros in dlrover, after pip installed the dlrover package #1260

Closed Desperadoze closed 4 weeks ago

Desperadoze commented 2 months ago

I encounter such erros during my training script, when using flash ckpt and installed dependencies by 'pip install dlrover[torch] -U'

Desperadoze commented 2 months ago

Error executing job with overrides: [] Traceback (most recent call last): File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 43, in main trainer.fit(model) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run _log_hyperparams(self) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/utilities.py", line 93, in _log_hyperparams logger.log_hyperparams(hparams_initial) File "/usr/local/lib/python3.10/dist-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn return fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/tensorboard.py", line 181, in log_hyperparams return super().log_hyperparams(params=params, metrics=metrics) File "/usr/local/lib/python3.10/dist-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/loggers/tensorboard.py", line 256, in log_hyperparams exp, ssi, sei = hparams(params, metrics) File "/usr/local/lib/python3.10/dist-packages/torch/utils/tensorboard/summary.py", line 246, in hparams ssi.hparams[k].number_value = v File "/usr/local/lib/python3.10/dist-packages/google/protobuf/internal/containers.py", line 70, in getitem return self._values[key] TypeError: list indices must be integers or slices, not str

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

majieyue commented 1 month ago

Please send issues in english in the future. I've modified this one and thank you for using dlrover

BalaBalaYi commented 4 weeks ago

This is likely a compatibility issue related to: Python 3.10 and the corresponding protobuf and gRPC versions. Please use py37~39 for now. DLRover will resolve the compatibility(py310) issue in the next version(0.4.0).