openclimatefix / skillful_nowcasting

Implementation of DeepMind's Deep Generative Model of Radar (DGMR) https://arxiv.org/abs/2104.00954

MIT License

223 stars 59 forks source link

Trying to execute run.py in train folder renders an error #45

Open bhardwaj-garvit opened 1 year ago

bhardwaj-garvit commented 1 year ago

Hi, Really great and helpful code! I was trying to run train.py on the nimrod-uk-1km-test data and encountered the following error, it says "RuntimeError: Serialization of parametrized modules is only supported through state_dict()." I searched on torch's website and found an earlier commit, so I downgraded torch to v1.12.0 but this did not go away. Torch Link: https://github.com/pytorch/pytorch/issues/69413

Can you guys help in debugging this issue? I am planning to use this on another dataset

To Reproduce Steps to reproduce the behavior:

installing dependencies
execute train/run.py and the above error shows in the terminal

jacobbieker commented 1 year ago

Hi, are you using multiple GPUs? By default the run.py tries to use 6 GPUs, although it should be changed to 1. The spectrally normalized layers in PyTorch don't seem to work in multi-GPU setting as far as I have been able to get them. So if you do change it to 1 GPU, it should start training

bhardwaj-garvit commented 1 year ago

I was earlier using cpu's, to sort the issue started using 1 gpu, but the training fills virtual memory of upto 200 GB(my system's limit) and the dataloader worker is killed. Can you suggest a way to bypass this.

Chevolier commented 7 months ago

I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?

Chevolier commented 7 months ago

I met the same issue, the memory keeps increasing to 256GB in the data loading process until it got killed by the system, any solution to solve this?

Updates: My problem is solved by setting streaming=True in TFDataset as follows for my own dataset, by doing this, data are not first loaded into memory.

class TFDataset(torch.utils.data.dataset.Dataset): def init(self, data_path, split): super().init()

self.reader = load_dataset(

    #     "openclimatefix/nimrod-uk-1km", "sample", split=split, streaming=True
    # )
    self.reader = load_dataset(data_path, split=split, **streaming=True**)