pdebench / PDEBench

PDEBench: An Extensive Benchmark for Scientific Machine Learning
Other
721 stars 83 forks source link

Darcy Flow Config Issues #48

Open arthurfeeney opened 1 year ago

arthurfeeney commented 1 year ago

I have run into two issues with darcy flow's config files:

  1. There are two config files config_darcy.yaml and args/config_Darcy.yaml. The documentation points to config_Darcy.yaml (capital D), but config_darcy.yaml (lowercase d) seems newer and more correct...? Should this be updated to fully replace the old one?

  2. config_darcy.yaml works with FNO, but has an error with Unet. By default, config_darcy sets initial_step=1 and t_train=1. I believe this is an error because the AR loops (here and here) go from initial_step to t_train, so it ends up not doing anything, since the range ends up being empty. This actually produces a confusing error, since the loss is initialized as a python int. Since the loop is empty, nothing is added onto loss, so it stays as an int:

Unet
Epochs = 500, learning rate = 0.001, scheduler step = 100, scheduler gamma = 0.5
Spatial Dimension 2
Total parameters = 7762465
start training...
Error executing job with overrides: []
Traceback (most recent call last):
  File "/dfs6/pub/afeeney/opensource/PDEBench/pdebench/models/train_models_forward.py", line 199, in main
    run_training_Unet(
  File "/data/homezvol2/afeeney/.conda/envs/pdebench/lib/python3.10/site-packages/pdebench/models/unet/train.py", line 414, in run_training
    train_l2_step += loss.item()
AttributeError: 'int' object has no attribute 'item'

I was able to get it running by setting t_train=2. I don't totally follow how the Darcy stuff is setup, so I'm not sure if that's a correct fix though...

qwerfdsadad commented 6 months ago

Have you successfully fixed this bug? For both FNO and Unet, I can't successfully run, and encountered the same problem as you. The dataset used comes from folder data_download.

FNO

FNO
Epochs = 30, learning rate = 0.001, scheduler step = 100, scheduler gamma = 0.5
FNODatasetSingle
/home/dp/miniconda3/envs/pdebench/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /home/builder/cbouss/pytorch/croot/pytorch_1685629640362/work/aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Spatial Dimension 2
Total parameters = 465557
Error executing job with overrides: ['+args=config_Darcy.yaml', '++args.filename=2D_DarcyFlow_beta10.0_Train.hdf5', '++args.model_name=FNO']
Traceback (most recent call last):
  File "/home/dp/PDEBench/pdebench/models/train_models_forward.py", line 166, in main
    run_training_FNO(
  File "/home/dp/miniconda3/envs/pdebench/lib/python3.9/site-packages/pdebench/models/fno/train.py", line 227, in run_training
    train_l2_step += loss.item()
AttributeError: 'int' object has no attribute 'item'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Unet

Unet
Epochs = 30, learning rate = 0.001, scheduler step = 100, scheduler gamma = 0.5
Spatial Dimension 2
Total parameters = 7765057
start training...
Error executing job with overrides: ['+args=config_Darcy.yaml', '++args.filename=2D_DarcyFlow_beta10.0_Train.hdf5', '++args.model_name=Unet']
Traceback (most recent call last):
  File "/home/dp/PDEBench/pdebench/models/train_models_forward.py", line 200, in main
    run_training_Unet(
  File "/home/dp/miniconda3/envs/pdebench/lib/python3.9/site-packages/pdebench/models/unet/train.py", line 414, in run_training
    train_l2_step += loss.item()
AttributeError: 'int' object has no attribute 'item'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.