openclimatefix / skillful_nowcasting

Implementation of DeepMind's Deep Generative Model of Radar (DGMR) https://arxiv.org/abs/2104.00954
MIT License
223 stars 59 forks source link

TypeError: __init__() missing 2 required positional arguments: 'node_def' and 'op' #32

Closed J-shel closed 2 years ago

J-shel commented 2 years ago

Describe the bug Hi, Thank you for sharing your implementation of DGMR. I'm new to deep learning, but I'm very interested in it and learning to use it in atmospheric science. When I run the code using the run.py under the train directory, I got the following message: ... ... 98.3 M Trainable params 0 Non-trainable params 98.3 M Total params 393.086 Total estimated model params size (MB)

Sanity Checking: 0it [00:00, ?it/s]2022-07-21 01:24:47.641350: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata". 2022-07-21 01:24:47.641881: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata". 2022-07-21 01:24:47.641954: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata". 2022-07-21 01:24:47.644718: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata". 2022-07-21 01:24:47.646172: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata". 2022-07-21 01:24:47.656873: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata". Traceback (most recent call last): File "run.py", line 205, in trainer.fit(model, datamodule) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit self._call_and_handle_interrupt( File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run results = self._run_stage() File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage return self._run_train() File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train self._run_sanity_check() File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check val_loop.run() File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, *kwargs) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(args, kwargs) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance batch = next(data_fetcher) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next return self.fetching_function() File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function self._fetch_next_batch(self.dataloader_iter) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch batch = next(iterator) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next data = self._next_data() File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data return self._process_data(data) File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data data.reraise() File "/work2/04310/jshel/stampede2/usr/local/miniconda3/envs/dgmropenclimatefix/lib/python3.8/site-packages/torch/_utils.py", line 454, in reraise raise self.exc_type(message=msg) TypeError: init() missing 2 required positional arguments: 'node_def' and 'op' ... ... Please see the attached run.log file for full log message.

To Reproduce Steps to reproduce the behavior:

  1. Go to the train directory;

  2. eidt run.py. Since I'm using CPU, so I changed the accelerator to "CPU". trainer = Trainer( max_epochs=1000, logger=wandb_logger, callbacks=[model_checkpoint],

    gpus=6,

    precision=32, accelerator="cpu"

  3. run "python run.py"

Expected behavior I'm not sure if I have done it in a right way to train the model using the radar data in the paper and how to use multiple cpus. The README.md file make it very clear about how to install the model and run it in a simple way. It may be nice to have a very small sample of train/val/test data of radar with the code or provide a link to download the train/val/test data manually since it would be very helpful to see what the data really like and to understand the model.

Additional context I attached the entire log file "run.log" and the packages I used just in case. run.log pip_list.txt

jacobbieker commented 2 years ago

Hi,

Glad you like the repo! There is a small set of train/validation/test located at "gs://dm-nowcasting-example-data/datasets/nowcasting_open_source_osgb/nimrod_osgb_1000m_yearly_splits/radar/20200718" in GCP. It seems that this issue has to do with being unable to access the sample dataset data. The run script uses this HuggingFace dataset script https://huggingface.co/datasets/openclimatefix/nimrod-uk-1km/blob/main/nimrod-uk-1km.py to load and process the data into the format that DGMR expects, and while it shouldn't need any credentials I think, as its a public GCP bucket, you might have to supply something?

J-shel commented 2 years ago

Hi, I tried to download the data in GCP simply using "gsutil cp -R gs://dm-nowcasting-example-data ." and it succeed. It didn't ask any credentials. Now it's kinda of confusing. Could you please take a look at the screenshot I attached? As you say, the run script uses nimrod-uk-1km.py to load and process the data, however I didn't find nimrod-uk-1km.py in my directory. Am I miss something? screenshot_1

jacobbieker commented 2 years ago

Yeah, the nimrod-uk-1km is downloaded to the HuggingFace cache, usually under ~/.cache/huggingface/ somewhere and is loaded on the fly from HuggingFace, so its not included in the repo.

J-shel commented 2 years ago

Yes, I found it! I have no idea why that happened, but when I move to a GPU machine, I didn't get that error any more. However, I met a new error as below. screentshot2

jacobbieker commented 2 years ago

Yeah, sorry, I've been trying to get it to run on multiple gpus, but it seems like there is an issue with parameterized modules that currently doesn't allow that. So if you change gpus to 1 it should work, you probably have to reduce the batch size as well

J-shel commented 2 years ago

Got it! Thank you very very much! O(∩_∩)O