Can this code be trained not by the ddp strategy?

sp-uhh / sgmse

Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation

MIT License

454 stars 69 forks source link

Using the default configuration to train comes out the error like this:

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:67: Starting from v1.9.0, tensorboardX has been remov ed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogg er as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

Traceback (most recent call last): File "train.py", line 123, in trainer.fit(model, ckpt_path=args.ckpt) File "E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 544, in fit call._call_and_handle_interrupt( File "E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\trainer\call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 102, in launch return function(args, kwargs) File "E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 947, in _run self.strategy.setup_environment() File "E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 148, in setup_environment self.setup_distributed() File "E:\anaconda3\envs\sgmse\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 199, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "E:\anaconda3\envs\sgmse\lib\site-packages\lightning_fabric\utilities\distributed.py", line 290, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "E:\anaconda3\envs\sgmse\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper func_return = func(args, kwargs) File "E:\anaconda3\envs\sgmse\lib\site-packages\torch\distributed\distributed_c10d.py", line 1177, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "E:\anaconda3\envs\sgmse\lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "E:\anaconda3\envs\sgmse\lib\site-packages\torch\distributed\rendezvous.py", line 174, in _create_c10d_store return TCPStore( torch.distributed.DistNetworkError: Unknown error

However, by changing the train.py code to this:

trainer = pl.Trainer( max_epochs=5, **vars(arg_groups['Trainer']), logger=logger, log_every_n_steps=10, num_sanity_val_steps=0, callbacks=callbacks )

removing the strategy to let this code train with only one gpu on the computer, here comes the error:

Any idea to solve this?

We don't encounter problems with the ddp strategy.

For your specific error message, this is what ChatGPT suggests:

The errors you are encountering arise from different issues during the distributed training setup and single-GPU training with PyTorch Lightning.

Let's tackle each problem one by one and provide potential solutions.

Error 1: Distributed Training Setup

The error you're seeing in the first case relates to setting up the distributed environment:

plaintext torch.distributed.DistNetworkError: Unknown error

This kind of error usually happens when the trainer tries to initialize a distributed process group but fails to connect to the specified address. Possible causes include:

Incorrect or missing environment variables like MASTER_ADDR, MASTER_PORT, etc.

Network connectivity issues between the processes.

Firewall settings blocking the connection.

Ensure that your environment is properly configured for distributed training. For a single machine multi-GPU setup, you can set the distributed backend appropriately or use the ddp_spawn strategy which may offer more stability in some environments:

Check Environment Variables: Ensure that environment variables for distributed training are set correctly.

bash export MASTER_ADDR=127.0.0.1 export MASTER_PORT=29500

Adjust Trainer Configuration: Use the ddp_spawn strategy in your PyTorch Lightning trainer:

python from pytorch_lightning import Trainer
trainer = Trainer(
    strategy="ddp_spawn",
    max_epochs=5,
    **vars(arg_groups['Trainer']),
    logger=logger,
    log_every_n_steps=10,
    num_sanity_val_steps=0,
    callbacks=callbacks
)
Check Firewall Settings: Ensure that ports are open and there are no firewall rules blocking the connection.

Error 2: Single-GPU Training

The error you're encountering when switching to single-GPU training is:

plaintext self.trainer.model.train() TypeError: train() missing 1 required positional argument: 'mode'

This error suggests that the train method is not being correctly called on your model. This usually means that there might be an override of the train method in your model class without the correct signature.

Here's how you can resolve this:

Check the Model Class: Ensure that your model class deriving from LightningModule does not have a custom train method, or if it does, it should match the signature of the base torch.nn.Module class:

python import torch from pytorch_lightning import LightningModule
class YourModel(LightningModule):
    def __init__(self, ...):
        super().__init__()
        # your model components initialization here

    def forward(self, x):
        # your forward pass here
        return x

    def training_step(self, batch, batch_idx):
        # training step implementation
        pass

    def train(self, mode=True):  # Ensure this signature
        super().train(mode)  # Call the base class method
        # add any custom behavior if needed

If you need custom behavior during switching between train/eval modes, make sure to call the superclass method to avoid this error.
Reinitialize Trainer Without Strategy: Initialize your trainer without specifying the strategy explicitly, which will default to the correct settings for a single GPU:

python from pytorch_lightning import Trainer
trainer = Trainer(
    max_epochs=5,
    **vars(arg_groups['Trainer']),
    logger=logger,
    log_every_n_steps=10,
    num_sanity_val_steps=0,
    callbacks=callbacks
)
By following these steps, you should be able to resolve the issues with both distributed and single-GPU training under PyTorch Lightning. If problems persist, ensure your PyTorch Lightning and PyTorch versions are compatible and consider consulting the official documentation or community forums for further troubleshooting.

sp-uhh / sgmse

Can this code be trained not by the ddp strategy? #44