Closed phtu-cs closed 3 years ago
Hey @tph9608 I re-cloned the code from github to check if I pushed a bug into the code by mistake, but everything seems to work as expected. Could you share more information about what changes you made to the code before you started the training process?
Thank you for your reply. I modify run.py line 93 and 99 from "log_save_interval" and "early_stop_callback" to "log_every_n_steps" and "callbacks", as "log_save_interval" and "early_stop_callback" are unexpected keywords on my machine. This is probably because of different versions of pytorch_lightning. I also use CUDA_VISIBLE_DEVICES=0, instead of 1.
OK. Everything seems to work fine if I installed the pytorch_lightning=0.9.0. But if I install pytorch ligthtning with pip without specifying its version, it does not work.
Hi Pradyumna,
Thank you very much for sharing your code. This will definitely inspire a lot of future works. I met an issue when runing your code. Hope to get your help when you are spare.
I ran the training command. But it logs
/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:102: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the
runner.fit(experiment)
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 566, in run_training_epoch
self.on_train_epoch_end(epoch_output)
File "/home/phtu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 606, in on_train_epoch_end
training_epoch_end_output = model.training_epoch_end(processed_epoch_output)
File "/home/phtu/Research/Meta/Im2Vec/experiment.py", line 115, in training_epoch_end
avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
RuntimeError: stack expects a non-empty TensorList
num_workers
argument(try 32 which is the number of cpus on this machine) in the
DataLoader` init to improve performance. rank_zero_warn( Epoch 0: 100%|████████████| 13/13 [00:00<00:00, 179.52it/s, loss=nan, v_num=110]Traceback (most recent call last): File "run.py", line 103, inIt seems the length of the var "outputs" is 0. Do you know a possible reason for this issue? Thank you.