Problems with running `./train/run.py` and Concerns with dependency versions

rutkovskii commented 1 year ago

Describe the bug

I am trying to run the ./train/run.py. I have several issues:

I had to update the

trainer = Trainer(
max_epochs=1000,
logger=wandb_logger,
callbacks=[model_checkpoint],
gpus=6,
precision=32,
# accelerator="tpu", devices=8
)

To

trainer = Trainer(
    max_epochs=1000,
    logger=wandb_logger,
    callbacks=[model_checkpoint],
    precision=32,
    devices=6,
    accelerator="gpu",
)

Due to having the newer version of Pytorch than it is originally was developed on.

I keep encountering this issue, which I feel like comes from miss match from the versions of the dependencies on which the DGMR was developed: I can assume that there's an issue with transferring a batch to the GPU device (I am not sure). Let me know if you have any suggestions on how I can verify it.


(venv) ➜  skillful_nowcasting git:(main) ✗ python3 run.py                                                          
wandb: Currently logged in as: rutkovskii (nowcasting-research). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.15.12 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.15.11
wandb: Run data is saved locally in /home/arutkovskii_umass_edu/skillful_nowcasting/wandb/run-20231008_213620-edi3updl
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run glowing-salad-28
wandb: ⭐️ View project at https://wandb.ai/nowcasting-research/dgmr
wandb: 🚀 View run at https://wandb.ai/nowcasting-research/dgmr/runs/edi3updl
/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py:398: UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:617: UserWarning: Checkpoint directory /home/arutkovskii_umass_edu/skillful_nowcasting exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

| Name               | Type                     | Params
----------------------------------------------------------------
0 | discriminator_loss | NowcastingLoss           | 0     
1 | grid_regularizer   | GridCellLoss             | 0     
2 | conditioning_stack | ContextConditioningStack | 4.2 M 
3 | latent_stack       | LatentConditioningStack  | 7.2 M 
4 | sampler            | Sampler                  | 42.1 M
5 | generator          | Generator                | 53.6 M
6 | discriminator      | Discriminator            | 44.7 M
----------------------------------------------------------------
98.3 M    Trainable params
0         Non-trainable params
98.3 M    Total params
393.086   Total estimated model params size (MB)
SLURM auto-requeueing enabled. Setting signal handlers.
Sanity Checking: 0it [00:00, ?it/s]2023-10-08 21:36:26.395303: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-08 21:36:28.016597: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-08 21:36:32.347232: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Repo card metadata block was not found. Setting CardData to empty.
Too many dataloader workers: 6 (max is dataset.n_shards=1). Stopping 5 dataloader workers.
2023-10-08 21:36:39.281432: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".
2023-10-08 21:36:39.481086: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2023-10-08 21:36:39.481129: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: gpu006
2023-10-08 21:36:39.481138: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: gpu006
2023-10-08 21:36:39.481189: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: 525.125.6
2023-10-08 21:36:39.481214: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 525.125.6
2023-10-08 21:36:39.481224: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:309] kernel version seems to match DSO: 525.125.6
Traceback (most recent call last):
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
results = self._run_stage()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
self._run_sanity_check()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
val_loop.run()
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 115, in run
self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 361, in _evaluation_step
batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 270, in batch_to_device
return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 332, in _apply_batch_transfer_handler
batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 321, in _call_batch_hook
return trainer_method(trainer, hook_name, *args)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 146, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 571, in transfer_batch_to_device
return move_data_to_device(batch, device)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/apply_func.py", line 101, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 80, in apply_to_collection
v = apply_to_collection(
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_utilities/core/apply_func.py", line 51, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/apply_func.py", line 95, in batch_to
data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "run.py", line 239, in trainer.fit(model, datamodule) File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit call._call_and_handle_interrupt( File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 67, in _call_and_handle_interrupt trainer._teardown() File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1003, in _teardown self.strategy.teardown() File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 498, in teardown self.lightning_module.cpu() File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 79, in cpu return super().cpu() File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in cpu return self._apply(lambda t: t.cpu()) File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 3 more times] File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 857, in _apply self._buffers[key] = fn(buf) File "/home/arutkovskii_umass_edu/skillful_nowcasting/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 967, in return self._apply(lambda t: t.cpu()) RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. wandb: 🚀 View run glowing-salad-28 at: https://wandb.ai/nowcasting-research/dgmr/runs/edi3updl wandb: ️⚡ View job at https://wandb.ai/nowcasting-research/dgmr/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEwMzkwMzkwNA==/version_details/v6 wandb: Synced 5 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s) wandb: Find logs at: ./wandb/run-20231008_213620-edi3updl/logs (venv) ➜ skillful_nowcasting git:(main) ✗


4. In order to run `./train/run.py`, should I move the `run.py` into the root of the repository?

5. Looks like TensorFlow is used inside of the `datasets` library from HuggingFace, inside of `nimrod-uk-1km.py`. Can you also indicate what is the functional version of TF?

If you know, please, tell me the potential source for the error in second comment and if you can, please provide the `requirements.txt` that includes versions of dependencies from the working code. What would be the version of CUDA?

## My dependencies (based on the ones in `requirements.txt`):

torch==2.1.0 antialiased-cnns==0.3 pytorch-msssim==1.0.0 numpy==1.24.3 torchvision==0.16.0 pytorch-lightning==2.0.9.post0 einops==0.7.0 huggingface-hub==0.17.3 tensorflow==2.13.1



CUDA version: 12.1 of the PyTorch since PyTorch comes with Cuda and cudnn binaries.

jacobbieker commented 11 months ago

Hi,

Sorry for the delayed response, I was on annual leave. I just pushed a minor update that should also fix some incompatibilities from using newer dependencies.

For 2. you can try running the test_train_dgmr in tests/test_model.py and set it to use GPUs, that should give a quicker and faster feedback loop on using the model. When I ran it on my CPU, there was no errors, so I think you might be right that its an issue with moving the batches to the GPU for some reason. That also gets rid of a lot of the extra code that might be masking the issue.

You should be able to run that from the folder it is in, I believe, but moving it to the root should also work.
It was tensorflow>2.0, newer ones should work though as well.

CUDA version is 11.8, the model currently doesn't run on my local machine, so I can try taking a look later on our ML machine.

rutkovskii commented 11 months ago

@jacobbieker Thank you for your help. I advanced forward a bit.

Your updates did make the code runnable. I obtained server with GPU that has 40 GB VRAM and it allowed to run the test_train_dgmr.
For the train/run.py I had to do this change (changing num_workers from 6 to 0) to the trainer and validator dataloaders in order to overcome CUDA error: unspecified launch failure: dataloader = DataLoader(TFDataset(split="train"), batch_size=1, num_workers=0)

I set batch_size=1 for both dataloaders, while trying to overcome this error, but I did not succeed:

File "/home/arutkovskii_umass_edu/.conda/envs/dgmr-venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 12.06 MiB is free. Including non-PyTorch memory, this process has 39.38 GiB memory in use. Of the allocated memory 1.73 GiB is allocated by PyTorch, and 113.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I think I might get it solved by getting server with larger VRAM (80GB instead of 40GB). Coming from that, I have 2 questions:

What size of GPU memory do you recommend?
Anything else that comes to mind seeing these issues?

Thanks!

Update 1

Following your advice, I managed to secure a server with a GPU boasting 80 GB of VRAM, but unfortunately, the error persists. My trials continue but it seems like VRAM size alone might not be the root of the problem. Any further insights would be greatly appreciated!

File "/home/arutkovskii_umass_edu/.conda/envs/dgmr-venv/lib/python3.9/site-packages/torch/nn/utils/parametrizations.py", line 470, in forward
    return weight / sigma
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 10.88 MiB is free. Including non-PyTorch memory, this process has 79.14 GiB memory in use. Of the allocated memory 1.83 GiB is allocated by PyTorch, and 24.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Update 2 - 15 Nov, 2023

Decreasing generation_steps: int = 2, from 6 to 2, allowed the 80 GB GPU to run the training.

@jacobbieker

GPU and Memory Questions:

Did you use generation_steps: int = 6, when you ran the training on the ML machine?
How many GB is inside of GPU(s) on your ML machine?
Have you or anyone got the training working on multiple GPUs?

Usage Questions:

When Is there any code that allows to use and then visualize the results as in Deepmind's notebook accompanying the paper?

rutkovskii commented 8 months ago

Does anybody have answers to these questions? Thank you in advance!

How large are the GPU(s) on the machine on which the training of DGMR was done?
Has anyone got the training working on multiple GPUs?

Javelin1991 commented 7 months ago

Does anybody have answers to these questions? Thank you in advance!

How large are the GPU(s) on the machine on which the training of DGMR was done?

Has anyone got the training working on multiple GPUs?

Please update me if you have managed to solve this issue. I am also encountering the same issue.

jacobbieker commented 7 months ago

Hi, we used an A6000 to do the intial training of DGMR, which has 48GB of GPU memory. We haven't tried it with multiple GPUs yet, last time I did, the spectral norm didn't work on multiple GPUs, but that might have been fixed now.

rutkovskii commented 7 months ago

@jacobbieker Was 48GB of GPU memory enough to do training with 6 generation steps? And what was the issue with spectral norm? Was it making the model metrics become nan?

@Javelin1991 I used A100 with 80GB of memory to run a few training epochs with 2 generation steps. 3 or more led to out-of-memory errors. Therefore I am looking into trying multiple GPUs to have more memory for 6 generation steps.

jacobbieker commented 7 months ago

Hi, the 48GB was enough for a few steps, not sure if all 6 though, its been awhile unfortunately. There were a few issues, originally it was because of nan values (see https://github.com/openclimatefix/skillful_nowcasting/issues/10), but then there were issues when trying multi-gpu with the spectral norm with errors about modifying in-place (see: https://github.com/openclimatefix/skillful_nowcasting/issues/47, although I noticed you commented there too!) I might be able to try training it again in a couple weeks and digging into it more, but unfortunately don't have much time soon for that.

rutkovskii commented 7 months ago

Thank you for the information. I soon will try to train on multiple gpus

Leaving solution and explanation for nan for people in case they encounter the same problem:

Explanation:

I resolved the nan issue on my side. Indeed, the code of DGMR was good and the problem was with my data.

Issues:

Specifically, I am using from netCDF4 import Dataset which by default creates numpy arrays that are masked. Doing calculations on the masked array produces nan.
Some values in my data were too large for grayscale.

Solution:

import numpy as np

def __check_rr_data(self, rr_data):
        if isinstance(rr_data, np.ma.MaskedArray):
            np.ma.filled(rr_data, 0)

        # Check for abnormal values
        if (rr_data.max() >= 65535).any():
            rr_data[rr_data >= 65535] = 0

More about that issue in "5b Write out mean, variance" here: https://towardsdatascience.com/debugging-a-machine-learning-model-written-in-tensorflow-and-keras-f514008ce736

openclimatefix / skillful_nowcasting