sktime / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
4.02k stars 637 forks source link

Trouble training with 2 GPUs #342

Open dempseyryan opened 3 years ago

dempseyryan commented 3 years ago

Similarities

I notice this is similar to #103 and #215, which were seemingly resolved. (?)

Also I should mention I'm not sure if this is perhaps a PL issue.

Expected behavior

I would like to train a model across 2 GPUs in order to speed up training. I just set the gpus flag of Pytorch Lightning's Trainer constructor to 2: gpus=2.

Accelerator

With the flag accelerator='ddp', I get one error, while with the flag accelerator='dp', the kernel is perpetually busy but does not begin training. With accelerator='ddp_spawn', I get a different error. I think the one to use for me would be ddp spawn, since DDP is not possible in Jupyter Notebook.

I think the accelerator I need for my case is ddp spawn.

Actual behavior

With ddp_spawn, the following error occurs: TypeError: can't pickle torch._C.Generator objects

With ddp, the kernel is perpetually busy and training doesn't start (presumably because I'm using a notebook).

With dp, the following error occurs: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

I suspect this one is because of the way my GPUs are I shouldn't be using dp but I'm really not sure. I figured I'd try all my options.

Is ddp the only one that works at this time? Is there any way of using 2 GPUs in a notebook?

Thanks for your help.

My understanding is that the sampler doesn't actually matter because PL overwrites it with DistributedSampler when you instantiate Trainer with gpus > 1, but I might be wrong so I included it below anyhow.

Code to reproduce the problem

### fit network
trainer.fit(
    tft,
    train_dataloader=train_dataloader,
    val_dataloaders=val_dataloader,
)

This call is where all three issues occur. Please see the training and trainer initialization below;

### Split up dataset appropriately

max_prediction_length = VALIDATION_LENGTH + TEST_LENGTH # Number of hours NOT shown to network

training_cutoff = data["time_idx"].max() - max_prediction_length # which hour value to stop training
parameters = data.columns[:3].tolist()

## Create training dataset
training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="occupancy",
#     group_ids=parameters,
    group_ids=[parameters[0]],
    min_encoder_length=MIN_ENCODER_LENGTH,
#     min_encoder_length=VALIDATION_LENGTH,
    max_encoder_length=MAX_ENCODER_LENGTH,
    min_prediction_length=MIN_PREDICTION_LENGTH,
    max_prediction_length=max_prediction_length,
#     static_categoricals=parameters,
    static_categoricals=[parameters[0]],
#     static_reals=[],
    static_reals=parameters[1:3],
    time_varying_known_categoricals=["day_wk", "time_day"],
    variable_groups={},
    time_varying_known_reals=["time_idx"],
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=["occupancy"],
    target_normalizer=GroupNormalizer(transformation=None, center=False),
    randomize_length=True, # should configure specific beta distribution at some pt
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True, # randomize time-length of samples between MIN and MAX LENGTH defined
)

# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
val_cutoff = data["time_idx"].max() - TEST_LENGTH

validation = TimeSeriesDataSet.from_dataset(
    training,
    data[lambda x: x.time_idx <= val_cutoff],
    predict=True,
    min_prediction_idx=training_cutoff + 1, # since max pred len of train > val len, set min val index
    max_prediction_length=VALIDATION_LENGTH,
    min_prediction_length=1,
    stop_randomization=True,
)

## Configure sampling
batch_sampler = torch.utils.data.RandomSampler(
    training,
    replacement=True,
    num_samples=2500, # randomly sample 20000 mini-timeseries' from various channels, length determined by encoder length
    generator=torch.Generator(),
)

batch_sampler = torch.utils.data.BatchSampler(batch_sampler, batch_size=BATCH_SIZE//2, drop_last=False)

# create dataloaders for model
train_dataloader = training.to_dataloader(
    train=True,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    # don't use any sampler because pl automatically samples for multi-GPU training
    batch_sampler=batch_sampler,
#     batch_sampler=sampler,
)

# train_dataloader = torch.utils.data.DataLoader(training, batch_size=BATCH_SIZE)

val_dataloader = validation.to_dataloader(
    train=False,
    batch_size=BATCH_SIZE * 8,
    num_workers=NUM_WORKERS,
)

### configure network and trainer
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-9, patience=10, verbose=True, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard
print_updates = PrintMetrics(10) # permanently print losses every 10 batches

trainer = pl.Trainer(
    max_epochs=100,
#     max_epochs=1, # test
    gpus=2, # use all GPUs,
#     accelerator='ddp_spawn', # cant use ddp in jupyter
    accelerator='dp',
    accelerator='dp',
    weights_summary='top',
#     gradient_clip_val=0.01,
    limit_train_batches=0.1,
#     shuffle=True,
#     fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[lr_logger, early_stop_callback, print_updates],
    logger=logger,
)

tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.002,
    lstm_layers=2,
    hidden_size=128,
    attention_head_size=4,
    dropout=0.3,
    hidden_continuous_size=128,
    output_size=7,  # 7 quantiles by default
    loss=QuantileLoss(),
    log_interval=10,  # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
    reduce_on_plateau_patience=5,
)

P.S. Loving this package. Getting some good results!

Edit: swapped error cases by accident

jdb78 commented 3 years ago

ddp is the preferred solution but does not work in notebooks (see #215). The bug for ddp_spawn probably needs some investigation into pickling some objects. Generally, it is not recommended for speed. See https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#distributed-data-parallel-spawn.

Is it possible to run a script for you? ddp seems to be the best solution.

dempseyryan commented 3 years ago

I tried training with a script (identical code, just no longer a notebook) with accelerator='ddp' and I get the following error about encoder lengths:

Traceback (most recent call last):
  File "Beginning_pipeline.py", line 407, in <module>
    val_dataloaders=val_dataloader,
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
    results = self.accelerator_backend.train()
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
    results = self.ddp_train(process_idx=self.task_idx, model=model)
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 305, in ddp_train
    results = self.train_or_test()
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
    results = self.trainer.train()
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 564, in run_training_epoch
    for batch_idx, (batch, is_last_batch) in train_dataloader:
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/profiler/profilers.py", line 83, in profile_iterable
    value = next(iterator)
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 46, in _with_is_last
    last = next(it)
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_forecasting/data/timeseries.py", line 1543, in _collate_fn
    encoder_lengths = torch.tensor([batch[0]["encoder_length"] for batch in batches], dtype=torch.long)
  File "/home/ubuntu/crcenv/lib/python3.7/site-packages/pytorch_forecasting/data/timeseries.py", line 1543, in <listcomp>
    encoder_lengths = torch.tensor([batch[0]["encoder_length"] for batch in batches], dtype=torch.long)
KeyError: 0

I don't know enough about this stuff to speculate whether this is a) an issue with PL's fit() method or b) an issue with PF's TFT class, but I figured I would post this here in case anyone has ideas.

nicocheh commented 2 years ago

@dempseyryan any news on this? I tried and had same problem with deepar

dempseyryan commented 2 years ago

@nicocheh unfortunately I never wound up resolving it. I'm no longer working on the project, but my "workaround" (if you can even call it that) was to train multiple neural networks simultaneously with 1 gpu each. This way I'm still speeding up the process of tuning and experimenting hyperparameters. Now if you have everything tuned and want to train a final time obviously it would be nice to be able to work this properly...