uzh-rpg / rpg_event_representation_learning

Repo for learning event representations
MIT License
137 stars 28 forks source link

Validation accuracy lower than expected. #2

Open Srutarshi opened 4 years ago

Srutarshi commented 4 years ago

Hi,

I trained the model for 100 epochs with the default parameters provided in your github code. I observe that my training loss reduces as expected.

However, the validation loss increases with epoch. The validation accuracy is also close to 0.26, much lower than 0.817. I used N-Caltech 101 dataset as mentioned in this repo.

Is there something I am doing wrong ? Please let me know.

training

eugenelyj commented 4 years ago

The same. Testing error is also much lower than expected. It seems something wrong.

danielgehrig18 commented 4 years ago

Will have a look and report to you when all is fixed. It is probably a bug I introduced recently. Sorry for the wait.

danielgehrig18 commented 4 years ago

So I have run the code directly cloned from github. With a fresh virtual environment and downloaded the dataset from our server. Unfortunately, I was not able to reproduce your accuracy curves. In general the model achieves >50% accuracy already after the first epoch.

Can you try to reclone and redo the steps in the README.md and let me know if it still a problem?

Srutarshi commented 4 years ago

@danielgehrig18 : I cloned the repo directly from github, downloaded the N-Caltech 101 dataset as mentioned in README.md. I used conda environment (instead of virtualenv) to install the dependencies in requirements.txt. Sadly the validation loss was still increasing with epoch instead of reducing. I am not sure if this a problem of versions.

Alternatively, I find the learning rate = 1e-4 in main.py, which is different from the learning rate in ICCV '19 paper. I suspect changing lowering the learning rate may work.

danielgehrig18 commented 4 years ago

Can you share with me the output of pip show maybe there is a version mismatch somewhere. Alternatively, could you try with virtualenv. Or even share the code that you are executing. Sorry, but I don't know where the bug could be otherwise.

Srutarshi commented 4 years ago

@danielgehrig18 : I am attaching the output of pip list in the conda environment pip_list.txt

I tried with virtualenv, it gave me the same results (training loss reducing and validation loss increasing from epoch 1)

I am attaching the files I am using to run files_zip.zip

It seems to be a problem of over-fitting. I think the only difference is I am using smaller batch_size : 8 or 12 (that's what my GPU memory allows), whereas in the paper you used batch size of 60.

I tried with learning_rate 1e-5, 1e-6, different optimizer : SGD instead of ADAM, different feature extractor : ResNet-18 instead of ResNet-34 :- still I am having this problem (of overfitting).

I am not sure what is going on.

danielgehrig18 commented 4 years ago

Thanks for the code. I was able to run it and noticed some differences to the reference code in the github repo:

  1. the learning rate was 1e-5 instead of 1e-4 (in the repo)
  2. the lr decay was 0.8 every 2 epochs instead of 0.5 every 10 epochs (in the repo)

Once I changed these values to the ones in the github repo I was able to achieve validation accuracy > 50% after the first epoch. Can you validate that these changes also help in your case? In any case the reference training command uses batch size 4 which uses approx. 10 GB of memory. Maybe the lower learning rate and faster decay made learning slower which might explain the stagnation in the validation loss. Is it possible that the code you used was older? I remember fixing some issues since uploading the code initially.

Srutarshi commented 4 years ago

@danielgehrig18 : I changed the lr and lr decay parameters in order to limit over-fitting of the data, which seemed to happen with the original values you had used in the repo. I used the latest code couple of times (cloning from git) but still got overfitting (in the validation data). Since you have published in ICCV '19, will it be possible for you to share your EST + EV-FlowNet implementation ?

juanmed commented 4 years ago
  1. the learning rate was 1e-5 instead of 1e-4 (in the repo)
  2. the lr decay was 0.8 every 2 epochs instead of 0.5 every 10 epochs (in the repo) In any case the reference training command uses batch size 4 which uses approx. 10 GB of memory.

@Srutarshi I tested the code with the changes mentioned above and got Validation Loss 0.6318 Accuracy 0.8381, after 60 epochs, all other parameters with default values. It took about 17GB in a Titan RTX GPU. I tried on the evaluation set and got Test Loss: 0.6673, Test Accuracy: 0.8360. This seems to be similar to the results shown in the repository's readme.

SoikatHasanAhmed commented 4 years ago

Thanks for the code. I was able to run it and noticed some differences to the reference code in the github repo:

  1. the learning rate was 1e-5 instead of 1e-4 (in the repo)
  2. the lr decay was 0.8 every 2 epochs instead of 0.5 every 10 epochs (in the repo)

Once I changed these values to the ones in the github repo I was able to achieve validation accuracy > 50% after the first epoch. Can you validate that these changes also help in your case? In any case the reference training command uses batch size 4 which uses approx. 10 GB of memory. Maybe the lower learning rate and faster decay made learning slower which might explain the stagnation in the validation loss. Is it possible that the code you used was older? I remember fixing some issues since uploading the code initially.

I did the changes mention here , but it doesn't help to get a good result. The model is still over-fitting. Screenshot from 2020-06-03 13-30-03 [note : using Titan xp , batch_size: 8]

shenhaibo123 commented 2 years ago
  1. 学习率是 1e-5 而不是 1e-4(在 repo 中)
  2. lr衰减是每2个epochs 0.8而不是每10个epochs 0.5(在repo中) 在任何情况下,参考训练命令使用batch size 4,它使用大约。10 GB 内存。

@Srutarshi 我用上面提到的更改测试了代码,并Validation Loss 0.6318 Accuracy 0.8381在 60 个 epoch 后得到了所有其他参数的默认值。Titan RTX GPU 大约需要 17GB。我尝试了评估集并得到了Test Loss: 0.6673, Test Accuracy: 0.8360. 这似乎与存储库自述文件中显示的结果相似。

Hello, thank you for your code. I have downloaded the latest code on Github, but there seems to be a huge gap between the results of the experiment. I used the learning rate and update mode you mentioned, and here are my experimental records(codes are the same with github). What seems to be the problem? 7b6febd8896f87c20103683e0aebdff

Wuziyi616 commented 2 years ago

So I also encountered this issue when I used the code. Then I looked into some data, and realized that the train/test label mapping are inconsistent. So I modify this line to (basically fix the label mapping to alphabetically)

self.classes = sorted(listdir(root))

which solves the problem. Now I can get >50% accuracy after 1 epoch with the default params in the code (except that I use batch_size=16), and 85.12% accuracy on the test set after training for 30 epochs

@danielgehrig18 maybe you want to fix this? I think many people face this issue

agirbau commented 1 year ago

@Wuziyi616 's answer is the fix to the problem (maybe do a quick pull request with the change in the line).