Interpretability - Githubissues

ThierryDeruyttere commented 5 years ago

Hi there!

First of all thanks for publishing the code! I really enjoy this work :-). I would like to ask if you could maybe share the exact parameters you used to create the network that produces this: https://camo.githubusercontent.com/e9e9464bfc10736d86b150ada2d8f68e74d3afae/68747470733a2f2f63732e7374616e666f72642e6564752f70656f706c652f646f72617261642f6d61632f696d67732f76697375616c2e706e67

Thanks in advance!

dorarad commented 5 years ago

Thanks a lot! :) Since I made the visualization you mention on the paper the code has gone through refactoring and clean-up, so not strictly the exact same code I used then, but the current version should definitely work just as well - the command is: python main.py --expName "experiment-interpretable" --train --testedNum 10000 --epochs 25 --netLength 4 @configs/args1.txt

args1.txt is the exact configuration that was described in the paper, and 3-6 cells (netLength) should produce most interpretable attentions (whereas 8-16 give a bit higher overall accuracy but naturally less interpretable since there are many steps).

Hope it helps!

ThierryDeruyttere commented 5 years ago

Hi @dorarad , thanks for the quick answer

I had already tested it with those exact arguments but just wanted to be sure. With my first test my validation accuracy was quite low so I took a fresh clone from the repo and retrained the network with the command you provided. Currently at epoch 24 I have the following values: Training Loss: 0.8928986946372279, Training accuracy: 0.5416270827112998 Training EMA Loss: 0.8324600319729415, Training EMA accuracy: 0.5857 Validation Loss: 1.0158769098195162, Validation accuracy: 0.4699

Are these values ok? Because the validation accuracy seems quite low compared to what was written in the paper. Could I maybe ask you to check this too? Thanks in advance!

PS: Could you maybe also reopen this issue? :) thanks!

Edit: Here are the values for epoch 25

Training Loss: 0.8776571237122207, Training accuracy: 0.5537972739571622 Training EMA Loss: 0.8108495558482388, Training EMA accuracy: 0.6003 Validation Loss: 1.0294084859616828, Validation accuracy: 0.4673

ThierryDeruyttere commented 5 years ago

After further thinking, maybe reopening this issue was not the best idea. Do you prefer that I create a new issue for this?

dorarad commented 5 years ago

Hi, sorry for closing! the values are definitely not what they are supposed to be it looks like a bug with one of the settings of args1 I will look into that! In the meantime I think it may be worth trying with standard args.txt I'm pretty sure the attention maps should be good for these settings as well (although will have to verify).

ThierryDeruyttere commented 5 years ago

I will give the standard args a try and report back :)

dorarad commented 5 years ago

awesome thanks a lot! :)

ThierryDeruyttere commented 5 years ago

So the network has trained for 6 epochs already and currently these are the scores: Training Loss: 0.9706893796575977, Training accuracy: 0.4714688373674443 Training EMA Loss: 0.9560210005251947, Training EMA accuracy: 0.485 Validation Loss: 0.962439235051473, Validation accuracy: 0.4775

So I think there might have crept some issue when refactoring. But could you please check this on your part?

dorarad commented 5 years ago

Alright thanks a lot for pointing that out! Clearly something got messed up I will check that and get back to you as soon as possible! (It worked for sure with the standard configuration arg.txt e.g. a few months ago after the refactoring and since then I didn't make much changes so I believe it shouldn't be too hard to find the problem will check)

ThierryDeruyttere commented 5 years ago

I know 😄 I had a working copy but then I had to move servers and lost it. I have btw seen that the lr always stay 0.0001 in previous version (that worked) the lr went to 1.25e^-5 at epoch 25 so maybe the error is there.

dorarad commented 5 years ago

thanks for the info! the lr gets reduces if val scores don't improve so that's no the reason, but I'm sure I'll find the problem in no time! :)

ThierryDeruyttere commented 5 years ago

Any update on this? :P

dorarad commented 5 years ago

I really apologize for the delay! things are a little busy right now so I couldn't yet find time to look into this but I will update as soon as I resolve it!

ThierryDeruyttere commented 5 years ago

Hi as it's been a long time now I was wondering if you had any update on this?

dorarad commented 5 years ago

hey sorry not yet :/ I know it's really been a lot of time I definitely plan to look into it but haven't got the time yet I do apologize! It may take a bit of time but will definitely update when I have new info! If you need it in the meantime - maybe a good solution can be to simply pull a prior version from the time when it worked for you? e.g. https://github.com/stanfordnlp/mac-network/tree/0085972777113170563f6c247dbdc82f16277799 ?

I went over all the commits afterwards and there are only very slight typos fixes in the code in master branch since then, but if an earlier version worked for you then, it has to work again - since then both data and model stay the same. Sorry that I don't have a better solution currently!

ThierryDeruyttere commented 5 years ago

Hi!

I'll clone the version you just proposed and i'll try it again. I'll let you know in a couple of minutes! The reason I ask is because we're creating a new dataset and would like to use your model together with other models as a baseline so I just want to make sure that the implementation is correct.

dorarad commented 5 years ago

Alright sounds good! let me know how it goes! :) I don't see a reason why it would stop working, maybe something got messed up in the features files for CLEVR? Also make sure you train over the full dataset (there's a flag for that).

I haven't used CLEVR for a while but if you're anyway interested in using the model for another dataset then there's even less of a problem - I'm right now using this model for other datasets such as VQA/GQA and it works fine. In particular there's the GQA branch with a more updated version (though a bit less clean) of the code which I use for these datasets.

Looking forward to the release of the new dataset!

ThierryDeruyttere commented 5 years ago

I will :). Yea i'm using the command you provided earlier so I suppose that should work, right? And the remark about the feature files might actually be a very good remark. I'll let you know soon. I'm almost at the end of epoch 1 and I have 43% accuracy

dorarad commented 5 years ago

How much time each epoch takes you? hmm maybe I know the source of your error - can you run a new experiment with the following flags: --generatedPrefix "newfeatures" --expName "newexp"? It will make sure you generate new feature files for the questions and don't work with temporary files that got generated for previous experiments.

ThierryDeruyttere commented 5 years ago

The first epoch took 2347.42 seconds. I will run a new experiment with the proposed flags

ThierryDeruyttere commented 5 years ago

i'm at epoch 5 and these are my results eb 5,10945 (699989 / 699989), t = 0.17 (0.00+0.17), lr 0.0001, l = 0.6771, a = 0.6250, avL = 0.9714, avA = 0.4727, g = 1.2093, emL = 0.9602, emA = 0.4678; newexp so average accuracy is still only 47%. I'm going to try to remake the features and report back

dorarad commented 5 years ago

alright let me know how it goes - maybe if the run of the validation features extraction was incomplete the model may then get zeros instead of real features for some or all val images, so worths trying making new ones

ThierryDeruyttere commented 5 years ago

Hi! I got very good news.

took 1722.15 seconds Training Loss: 0.25686724214361023, Training accuracy: 0.8955540729925756 Training EMA Loss: 0.15401203917713385, Training EMA accuracy: 0.9398 Validation Loss: 0.17199736024297427, Validation accuracy: 0.9282 Training epoch 5... eb 5,3000 (191928 / 699989), t = 0.22 (0.00+0.18), lr 0.0001, l = 0.0526, a = 0.9844, avL = 0.2241, avA = 0.9094, g = 3.2902, emL = 0.2145, emA = 0.9161; newexpp saving weights eb 5,6000 (383709 / 699989), t = 0.22 (0.00+0.19), lr 0.0001, l = 0.4795, a = 0.7969, avL = 0.2223, avA = 0.9101, g = 4.0710, emL = 0.2386, emA = 0.9063; newexpp saving weights eb 5,9000 (575549 / 699989), t = 0.19 (0.00+0.16), lr 0.0001, l = 0.2768, a = 0.9375, avL = 0.2216, avA = 0.9104, g = 5.6670, emL = 0.2320, emA = 0.9051; newexpp saving weights eb 5,10945 (699989 / 699989), t = 0.14 (0.00+0.14), lr 0.0001, l = 0.1774, a = 0.9531, avL = 0.2199, avA = 0.9111, g = 3.0387, emL = 0.2147, emA = 0.9142; newexpp Restoring EMA weights eb 5,164 (10000 / 10000), t = 0.06 (0.00+0.06), lr 0.0001, l = 0.1906, a = 0.9219, avL = 0.1316, avA = 0.9481, g = -1.0000, emL = 0.1325, emA = 0.9472; newexp eb 5,165 (10000 / 10000), t = 0.05 (0.00+0.05), lr 0.0001, l = 0.0054, a = 1.0000, avL = 0.1587, avA = 0.9390, g = -1.0000, emL = 0.1421, emA = 0.9481; newexp Restoring standard weights

It got finally fixed! The issue was indeed with the features itself. I'm glad we sorted this out.

dorarad commented 5 years ago

awesome!! Glad it got fixed! :)

stanfordnlp / mac-network

Interpretability #14