[BUG] TypeError then predicting in multi-gpu scenario

nejox commented 8 months ago

Describe the bug After training a TFT with ddp_spawnstrategy on multiple gpus in Amazon SageMaker the returned prediction of the trainer is None, leading to an TypeError: 'NoneType' object is not iterable in torch_forecasting_model.

To Reproduce Code to reproduce the problem. I tried it locally with "accelerator":"cpu"also (don't know if this is even a valid approach) and end up with the same error.

import torch
from darts.models import TFTModel
from darts.datasets import AirPassengersDataset

if __name__ == "__main__":
    torch.multiprocessing.freeze_support()
    series = AirPassengersDataset().load()
    series = [series] * 10

    model = TFTModel(
        input_chunk_length=12,
        output_chunk_length=6,
        add_relative_index=True,
        pl_trainer_kwargs={
            "accelerator": "gpu",
            "strategy": "ddp_spawn",
            "devices": "2"
           }
    )
    model.fit(series, epochs=10)
    preds = model.predict(n=6, series=series, num_samples=100)
    print("len(preds)", len(preds), "len(series)", len(series))

=> leads to:

Traceback (most recent call last):
  File ".../src/test_multigpu.py", line 21, in <module>
    preds = model.predict(n=6, series=series, num_samples=100)
  File ".../darts/models/forecasting/torch_forecasting_model.py", line 2784, in predict
    return super().predict(
  File ".../darts/utils/torch.py", line 112, in decorator
    return decorated(self, *args, **kwargs)
  File ".../darts/models/forecasting/torch_forecasting_model.py", line 1371, in predict
    predictions = self.predict_from_dataset(
  File ".../darts/utils/torch.py", line 112, in decorator
    return decorated(self, *args, **kwargs)
  File ".../darts/models/forecasting/torch_forecasting_model.py", line 1519, in predict_from_dataset
    return [ts for batch in predictions for ts in batch]
TypeError: 'NoneType' object is not iterable

Expected behavior Successfully predict values, does not need to be multi gpu prediction as I only need the speed up for training.

System (please complete the following information):

Python version: 3.9
darts version: 0.27.2
lightning version: 2.1.3
torch version: 2.1.0
OS: macOS 14.2.1 (23C71)

Additional context Lightning documentation recommends ddp and not to use ddp_spawn but darts only supports ddp_spawn, right? I couldn't get ddp running as i had problems with multiple executions of my script due to 1 process per gpu and thus multiple checkpoints created.

BohdanBilonoh commented 8 months ago

Faced the same issue. Found that distributed Lighting inference requires BasePredictionWriter https://lightning.ai/docs/pytorch/stable/deploy/production_basic.html#enable-distributed-inference

dennisbader commented 8 months ago

@BohdanBilonoh, would you mind sharing your use of the BasePredictionWriter to help other users that face the same issue? Would also be interesting to see, in case it is something that we can add to Darts :)

BohdanBilonoh commented 8 months ago

@dennisbader sure. As a hot fix I changed:

TorchForecastingModel.predict

Original

return predictions[0] if called_with_single_series else predictions

New code

if predictions:
    return predictions[0] if called_with_single_series else predictions
else:
    return None

TorchForecastingModel.predict_from_dataset

Original

return [ts for batch in predictions for ts in batch]

New code

if predictions:
    return [ts for batch in predictions for ts in batch]
else:
    return None

After that something like this can be added to trainer callbacks

# or you can set `write_interval="batch"` and override `write_on_batch_end` to save
# predictions at batch level
class CustomWriter(BasePredictionWriter):
    def __init__(self, output_dir, write_interval):
        super().__init__(write_interval)
        self.output_dir = output_dir

    def write_on_epoch_end(self, trainer, pl_module, predictions, batch_indices):
        # this will create N (num processes) files in `output_dir` each containing
        # the predictions of it's respective rank
        torch.save(predictions, os.path.join(self.output_dir, f"predictions_{trainer.global_rank}.pt"))

        # optionally, you can also save `batch_indices` to get the information about the data index
        # from your prediction data
        torch.save(batch_indices, os.path.join(self.output_dir, f"batch_indices_{trainer.global_rank}.pt"))

Again this is very quick hot fix. As a feature of darts it cloud be add with following logic

check if number of gpus > 1 and trainer strategy is added
add to callbacks custom subclass of BasePredictionWriter (like with checkpoints)

Also following code shows how to infer a model that was trained with ddp on a single gpu

trained_model.trainer_params["accelerator"] = "gpu"
trained_model.trainer_params["devices"] = 0  # or any other single index (not list)
model.trainer_params["strategy"] = "auto"

After that predictions from the predict method will return as usual

stompsjo commented 7 months ago

I wanted to chime in to add a +1 on this issue. I was having similar issues and have recreated the error and fix with my scripts as well. I'll put a caveat here that I am also running darts as part of a kedro pipeline, where training and testing occur in two separate nodes. There is always the possibility that kedro is interfering with the handling for multiprocessing (e.g. torch.multiprocessing.freeze_support()) but I am fairly confident I have isolated this to darts.

My error occurs when running TFTModel.historical_forecast and is the exact TypeError described above. I have verified that

Running historical_forecasts with accelerator=gpu, devices=auto, strategy=ddp causes the TypeError.
Running historical_forecasts with accelerator=gpu, devices=0, strategy=auto does not cause an error (i.e. switching from multi-GPU training to single GPU testing works).

However, the above hotfix does not work when using TFTModel.historical_forecast. I still end up with a TypeErroron darts.utils.historical_forecast._optimized_historical_forecasts line 131. I think that again the main process is trying to run the method with predictions=None because the results from the discributed processes is not returned/processed. I tried adding to _optimized_historical_forecasts similar logic to above (if predictions: ... else: return None), but then I ended up with a PyTorch assertion error: assert self.num_samples >= 1 or self.total_size == 0. Is anyone else able to verify if the above fix works with historical_forecasts?

Also, I'm confused why the predictions need to be written to file with the BasePredictionWriter (maybe I should read the Lightning docs more closely). Is it possible for them to just be unified in memory?

leandrolanzieri commented 1 week ago

but then I ended up with a PyTorch assertion error: assert self.num_samples >= 1 or self.total_size == 0. Is anyone else able to verify if the above fix works with historical_forecasts?

Same here. Could not figure it out yet...

unit8co / darts

[BUG] TypeError then predicting in multi-gpu scenario #2265