state-spaces / s4

Structured state space sequence models
Apache License 2.0
2.43k stars 292 forks source link

ValueError when running on Pathfinder #33

Closed andrewliu2001 closed 2 years ago

andrewliu2001 commented 2 years ago

Hi, I am getting the following error when trying to train S4 on the pathfinder dataset. Any help would be greatly appreciated.

Traceback (most recent call last): File "/data/al451/state-spaces/train.py", line 553, in main train(config) File "/data/al451/state-spaces/train.py", line 498, in train trainer.fit(model) File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit self._call_and_handle_interrupt( File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run self._call_setup_hook() # allow user to setup lightning_module in accelerator environment File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1492, in _call_setup_hook self._call_lightning_module_hook("setup", stage=fn) File "/home/al451/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook output = fn(args, **kwargs) File "/data/al451/state-spaces/train.py", line 56, in setup self.dataset.setup() File "/data/al451/state-spaces/src/dataloaders/datasets.py", line 1234, in setup dataset = PathFinderDataset(self.data_dir, transform=self.default_transforms()) File "/data/al451/state-spaces/src/dataloaders/datasets.py", line 1130, in init path_list = sorted( File "/data/al451/state-spaces/src/dataloaders/datasets.py", line 1132, in key=lambda path: int(path.stem), ValueError: invalid literal for int() with base 10: '._142'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

albertfgu commented 2 years ago

Are you running it using the command in the README? What is your torch and pytorch-lightning version?

andrewliu2001 commented 2 years ago

Hi, yes I ran CUDA_VISIBLE_DEVICES=0,5,6,7 python -m train wandb=null experiment=s4-lra-pathx. I am using torch 1.11.0+cu113 and pytorch-lightning 1.6.3.

albertfgu commented 2 years ago

Could you try pytorch-lightning==1.5.10? We've had issues with 1.6 and later

andrewliu2001 commented 2 years ago

Hi, I tried using pytorch-lightning 1.5.10 and I still get the same issue.

albertfgu commented 2 years ago

It seems like your data might not be set up correctly. If you look at the line where it is throwing an error, it expects the data to look like data/pathfinder/pathfinder32/curv_contour_length_14/metadata/{0,1,2,...}.npy. The error you have looks like it might have found files of the form _142.npy. Can you check your data structure and re-download the data if necessary?