Closed rutkovskii closed 3 weeks ago
Hi,
Sorry for the delayed response, I was on annual leave. I just pushed a minor update that should also fix some incompatibilities from using newer dependencies.
For 2. you can try running the test_train_dgmr
in tests/test_model.py
and set it to use GPUs, that should give a quicker and faster feedback loop on using the model. When I ran it on my CPU, there was no errors, so I think you might be right that its an issue with moving the batches to the GPU for some reason. That also gets rid of a lot of the extra code that might be masking the issue.
You should be able to run that from the folder it is in, I believe, but moving it to the root should also work.
It was tensorflow>2.0, newer ones should work though as well.
CUDA version is 11.8, the model currently doesn't run on my local machine, so I can try taking a look later on our ML machine.
@jacobbieker Thank you for your help. I advanced forward a bit.
Your updates did make the code runnable. I obtained server with GPU that has 40 GB VRAM and it allowed to run the test_train_dgmr
.
For the train/run.py
I had to do this change (changing num_workers
from 6 to 0) to the trainer and validator dataloaders in order to overcome CUDA error: unspecified launch failure
:
dataloader = DataLoader(TFDataset(split="train"), batch_size=1, num_workers=0)
I set batch_size=1
for both dataloaders, while trying to overcome this error, but I did not succeed:
File "/home/arutkovskii_umass_edu/.conda/envs/dgmr-venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 12.06 MiB is free. Including non-PyTorch memory, this process has 39.38 GiB memory in use. Of the allocated memory 1.73 GiB is allocated by PyTorch, and 113.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I think I might get it solved by getting server with larger VRAM (80GB instead of 40GB). Coming from that, I have 2 questions:
Thanks!
Following your advice, I managed to secure a server with a GPU boasting 80 GB of VRAM, but unfortunately, the error persists. My trials continue but it seems like VRAM size alone might not be the root of the problem. Any further insights would be greatly appreciated!
File "/home/arutkovskii_umass_edu/.conda/envs/dgmr-venv/lib/python3.9/site-packages/torch/nn/utils/parametrizations.py", line 470, in forward
return weight / sigma
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 10.88 MiB is free. Including non-PyTorch memory, this process has 79.14 GiB memory in use. Of the allocated memory 1.83 GiB is allocated by PyTorch, and 24.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Decreasing generation_steps: int = 2,
from 6 to 2, allowed the 80 GB GPU to run the training.
@jacobbieker
GPU and Memory Questions:
generation_steps: int = 6,
when you ran the training on the ML machine?Usage Questions:
Does anybody have answers to these questions? Thank you in advance!
Does anybody have answers to these questions? Thank you in advance!
- How large are the GPU(s) on the machine on which the training of DGMR was done?
- Has anyone got the training working on multiple GPUs?
Please update me if you have managed to solve this issue. I am also encountering the same issue.
Hi, we used an A6000 to do the intial training of DGMR, which has 48GB of GPU memory. We haven't tried it with multiple GPUs yet, last time I did, the spectral norm didn't work on multiple GPUs, but that might have been fixed now.
@jacobbieker Was 48GB of GPU memory enough to do training with 6 generation steps?
And what was the issue with spectral norm? Was it making the model metrics become nan
?
@Javelin1991 I used A100 with 80GB of memory to run a few training epochs with 2 generation steps. 3 or more led to out-of-memory errors. Therefore I am looking into trying multiple GPUs to have more memory for 6 generation steps.
Hi, the 48GB was enough for a few steps, not sure if all 6 though, its been awhile unfortunately. There were a few issues, originally it was because of nan
values (see https://github.com/openclimatefix/skillful_nowcasting/issues/10), but then there were issues when trying multi-gpu with the spectral norm with errors about modifying in-place (see: https://github.com/openclimatefix/skillful_nowcasting/issues/47, although I noticed you commented there too!) I might be able to try training it again in a couple weeks and digging into it more, but unfortunately don't have much time soon for that.
Thank you for the information. I soon will try to train on multiple gpus
Leaving solution and explanation for nan
for people in case they encounter the same problem:
Explanation:
I resolved the nan
issue on my side. Indeed, the code of DGMR was good and the problem was with my data.
Issues:
from netCDF4 import Dataset
which by default creates numpy arrays that are masked. Doing calculations on the masked array produces nan
. Solution:
import numpy as np
def __check_rr_data(self, rr_data):
if isinstance(rr_data, np.ma.MaskedArray):
np.ma.filled(rr_data, 0)
# Check for abnormal values
if (rr_data.max() >= 65535).any():
rr_data[rr_data >= 65535] = 0
More about that issue in "5b Write out mean, variance" here: https://towardsdatascience.com/debugging-a-machine-learning-model-written-in-tensorflow-and-keras-f514008ce736
@jacobbieker
Hi Jacob,
I continued working with your implementation of DGMR and can train it with our data, full configuration (6 generation steps, batch size 16, precision 32) on NVidia A100-80GB whether it is 1 GPU or 8 GPUs running via DDP.
I created a pull request with few additions to help other users get started with limited resources. Pl Please take a look. https://github.com/openclimatefix/skillful_nowcasting/pull/77
Describe the bug
I am trying to run the
./train/run.py
. I have several issues:To
Due to having the newer version of Pytorch than it is originally was developed on.
I keep encountering this issue, which I feel like comes from miss match from the versions of the dependencies on which the DGMR was developed: I can assume that there's an issue with transferring a batch to the GPU device (I am not sure). Let me know if you have any suggestions on how I can verify it.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "run.py", line 239, in
trainer.fit(model, datamodule)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 67, in _call_and_handle_interrupt
trainer._teardown()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1003, in _teardown
self.strategy.teardown()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 498, in teardown
self.lightning_module.cpu()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 79, in cpu
return super().cpu()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in cpu
return self._apply(lambda t: t.cpu())
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 857, in _apply
self._buffers[key] = fn(buf)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. wandb: 🚀 View run glowing-salad-28 at: https://wandb.ai/nowcasting-research/dgmr/runs/edi3updl wandb: ️⚡ View job at https://wandb.ai/nowcasting-research/dgmr/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEwMzkwMzkwNA==/version_details/v6 wandb: Synced 5 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20231008_213620-edi3updl/logs (venv) ➜ skillful_nowcasting git:(main) ✗
torch==2.1.0 antialiased-cnns==0.3 pytorch-msssim==1.0.0 numpy==1.24.3 torchvision==0.16.0 pytorch-lightning==2.0.9.post0 einops==0.7.0 huggingface-hub==0.17.3 tensorflow==2.13.1