unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.96k stars 866 forks source link

GPU Optimization with Num_Workers not working #2354

Open Laenita opened 5 months ago

Laenita commented 5 months ago

I am not very experienced, but I loveee this package. However, my GPU acceleration seems to only utilize about 1% of my GPU. Increasing the batch size made my predictions far less accurate. And I read that increasing num_loader_workers will work, but I get an log stating that I should set persistent_workers =True in the val_dataloader package but I know Darts does not work this way. And then the model runs 5 times longer. Can you please assist? I just got a better GPU to optimize my training time but I can't get it to use more of the GPU? Here is my model for reference:

    NHiTS_Model = NHiTSModel(
    model_name="Nhits_run",
    input_chunk_length=input_length_chunk,
            output_chunk_length=forecasting_horizon,
            num_stacks=number_stacks,
            num_blocks=number_blocks,
            num_layers=number_layers,
            layer_widths=lay_widths,
            n_epochs=number_epochs,
            nr_epochs_val_period=number_epochs_val_period,
            batch_size=batch_size,
            dropout=dropout_rate,
            force_reset = True,
            save_checkpoints=True,
            optimizer_cls = torch.optim.AdamW,
            loss_fn = torch.nn.HuberLoss(), 
            random_state =rand_state,
            pl_trainer_kwargs={
                    "accelerator": "gpu", 
                    "devices": [0]}  
              )
    NHiTS_Model.fit(
        series=train,
        past_covariates=train_cov,
        verbose=True,
        val_series=val,
        val_past_covariates=val_cov,
        num_loader_workers=1
    )
Laenita commented 5 months ago

Oh and the newer gpu and the much weaker one trains the same length of time so somewhere is a bottle neck.

madtoinou commented 5 months ago

Hi @Laenita,

Would you mind sharing the value of the parameters? So that we can have an idea of the number of parameters/size of the model.

Is the GPU acceleration being used at 1% for both the old and the new devices?

The pl_trainer_kwargs argument looks good, this is what Pytorch-Lightning expects to enable this acceleration. I would recommend looking up their documentation at this this what Darts relies on for the deep learning models.

Laenita commented 5 months ago

Hi @madtoinou

Of course here are my parameters for my model I hope this helps: input_length_chunk = 20 forecasting_horizon = 3 number_stacks = 4 number_blocks = 5 number_layers = 5 batch_size = 64 dropout_rate = 0.1 number_epochs = 180 number_epochs_val_period = 1

And yes, both the old and newer (and much faster) GPU's are both only showing 1% utilisation and also training the same time on the same model, indicating that something is wrong and heavy under-utilising.

But also the num_loader_workers=1 is not working at all for me, takes more than an hour with num_loader_workers >0.

Thanks for your assistance!

igorrivin commented 5 months ago

Yes, I have the same problem: I am told that num_loader_workers is not a legit parameter.

madtoinou commented 5 months ago

Hi @igorrivin & @Laenita,

As mentioned in another tread, the PR ##2295 is adding support for those arguments. Maybe try installing this branch/copy the changes and see if it solves the bottleneck?

Laenita commented 4 months ago

Hi @madtoinou

I have copied the changes from PR #https://github.com/unit8co/darts/pull/2295 But now whenever I add persistent_workers= True and num_loader_workers=16 (or even just 1) it gets stuck on Sanity_checking? Did I maybe miss anything? Thank you for your assistance!

madtoinou commented 4 months ago

Which sanity checking are you referring to?

Laenita commented 4 months ago

Hi @madtoinou, the best explanation I can show is this PNG where the model first goes into a Sanity Checking Phase before starting training: Sanity Checking

madtoinou commented 1 month ago

Hi @Laenita,

Is the problem still occurring?

The sanity checks is a mechanism implemented by pytorch lightning (see here), you could try to disable it by passing pl_trainer_kwargs={"num_sanity_val_steps":0}.

Since #2295 has been merged, you could try again with dataloader_kwargs={"persistent_worker":True, "num_loader_worker":1} in fit().

Also, does the GPU utilization increase if you increase the size of the model? And when you change the size of the batch? Or is it always 1%?