Closed hongdoki closed 7 years ago
Hi @hongdoki
I corrected the pointers in the README, sorry about that.
The hyper parameters for the setup SVHN -> MNIST are:
The architecture that we used in the paper is svhn_model
. Can you please try this one?
The other hyper params are correct I think. For completeness, here is the list of hyper params that produced state of the are:
"target_dataset": "mnist3", "walker_weight_envelope_delay": "500", "new_size": 32, "dataset": "svhn", "sup_per_batch": 100, "decay_steps": 9000, "unsup_batch_size": 1000, "sup_per_class": -1, "walker_weight_envelope_steps": 1, "walker_weight_envelope": "linear", "visit_weight_envelope": "linear", "architecture": "svhn_model", "visit_weight": 0.2, "max_steps": "12000"
Cheers, Philip
Thank you for quick replying!
Now I can reproduce the result of the paper with your hyper parameter. :+1:
Actually, The reason I tried "mnist_model" is that the paper said all experiment used the network architecture below.
C(32; 3) -> C(32; 3) -> P(2) -> C(64; 3) -> C(64; 3) -> P(2) -> C(128; 3) -> C(128; 3) -> P(2) -> FC(128)
and "svhn_model" is
C(32; 3) -> C(32; 3) -> C(32; 3) -> P(2) -> C(64; 3) -> C(64; 3) -> C(64; 3) -> P(2) -> C(128; 3) -> C(128; 3) -> C(128; 3) -> P(2) -> FC(128)
I think it needs some correction or I misunderstood something.
Anyway, thank you again for sharing your hyper parameter!
Well spotted, thank you. Happy that it is working for you.
Hi @hongdoki and @haeusser
Sorry to be annoying, but per my other issue, I'm still having trouble replicating any of the results reported in the paper.
It's strange: I clone the repo and run the exact parameters mentioned above in @haeusser 's reply (see below for command) yet I only get an accuracy of about 0.89. Is there something I'm missing here? Maybe I'm not running the eval script correctly (second command below)?
Any thoughts, or exact instructions on how to replicate any of the results from the paper, would be greatly appreciated.
Liam
# For SVHN to MNIST
CUDA_VISIBLE_DEVICES=3 python semisup/train.py \
--target_dataset="mnist3" \
--walker_weight_envelope_delay=500 \
--new_size=32 \
--dataset="svhn" \
--sup_per_batch=100 \
--decay_steps=9000 \
--unsup_batch_size=1000 \
--sup_per_class=-1 \
--walker_weight_envelope_steps=1 \
--walker_weight_envelope="linear" \
--visit_weight_envelope="linear" \
--architecture="svhn_model" \
--visit_weight=0.2 \
--max_steps=12000 \
--logdir=./log/svhn_to_mnist/reproduce
CUDA_VISIBLE_DEVICES=3 python semisup/eval.py \
--target_dataset="mnist3" \
--walker_weight_envelope_delay=500 \
--new_size=32 \
--dataset="svhn" \
--sup_per_batch=100 \
--decay_steps=9000 \
--unsup_batch_size=1000 \
--sup_per_class=-1 \
--walker_weight_envelope_steps=1 \
--walker_weight_envelope="linear" \
--visit_weight_envelope="linear" \
--architecture="svhn_model" \
--visit_weight=0.2 \
--max_steps=12000 \
--logdir=./log/svhn_to_mnist/reproduce
Which versions of Tensorflow, CUDA and CUDNN are you using?
... and does the eval job really evaluate the latest checkpoints? Have you tried to run the same experiment a few times? Usually the random initialization should not have a big effect.
I was using Tensorflow 1.2, but also tried 1.1. CUDA 7.5 and cuDNN 5.0.
Yep I have tried running it a few times, always with the same lackluster results. Maybe I'll try it on an AWS instance...
@nlml Hm, that doesn't sound right. It is weird that it works for everyone else. Can you double check that the data sets are loaded correctly? Does your system have any special settings for floats? Which OS are you using, after all?
Unfortunately I don't have access to the GPU I was using any more. Can't remember exactly which OS, but it was a fairly standard Linux server setup, so I suppose a recent Ubuntu? Wasn't running any special settings for floats.
When I have some time I will try on an AWS instance. If you get a chance could you maybe confirm exact commands/instructions and the results I should get after X iterations for really any of the notable results you reported? Or if what I've posted above is fine, then just confirm that?
This is the set of flags for the run that produced the result in the paper:
{ "target_dataset": "mnist3", "walker_weight_envelope_delay": "500", "max_checkpoints": 5, "new_size": 32, "dataset": "svhn", "sup_per_batch": 100, "decay_steps": 9000, "unsup_batch_size": 1000, "sup_per_class": -1, "walker_weight_envelope_steps": 1, "walker_weight_envelope": "linear", "visit_weight_envelope": "linear", "architecture": "svhn_model", "visit_weight": 0.2, "max_steps": "12000" }
Great, thanks again :+1:
Of course. I hope we can track down the problem!
Hi again @haeusser
So to test this in a different environment, I instantiated this AWS instance image with tensorflow, etc on a p2.xlarge instance (which has one Tesla K80 GPU).
I then SSH to the instance, clone the repo and alter data dirs:
cd /home/ubuntu;
git clone https://github.com/haeusser/learning_by_association.git;
cd learning_by_association/;
perl -i -pe 's/work\/haeusser\/data/home\/ubuntu\/datasets/g' semisup/tools/data_dirs.py;
Make the required changes to ~/.bashrc:
echo -e "\n\nexport PYTHONPATH=/home/ubuntu/learning_by_association:\$PYTHONPATH" >> ~/.bashrc;
source ~/.bashrc;
Download datasets:
mkdir /home/ubuntu/datasets/;
mkdir /home/ubuntu/datasets/svhn/;
mkdir /home/ubuntu/datasets/mnist/;
wget http://ufldl.stanford.edu/housenumbers/test_32x32.mat -O /home/ubuntu/datasets/svhn/test_32x32.mat;
wget http://ufldl.stanford.edu/housenumbers/train_32x32.mat -O /home/ubuntu/datasets/svhn/train_32x32.mat;
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz -O /home/ubuntu/datasets/mnist/train-images-idx3-ubyte.gz;
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz -O /home/ubuntu/datasets/mnist/train-labels-idx1-ubyte.gz;
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz -O /home/ubuntu/datasets/mnist/t10k-images-idx3-ubyte.gz;
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz -O /home/ubuntu/datasets/mnist/t10k-labels-idx1-ubyte.gz;
Then run:
python semisup/train.py \
--target_dataset="mnist3" \
--walker_weight_envelope_delay=500 \
--new_size=32 \
--dataset="svhn" \
--sup_per_batch=100 \
--decay_steps=9000 \
--unsup_batch_size=1000 \
--sup_per_class=-1 \
--walker_weight_envelope_steps=1 \
--walker_weight_envelope="linear" \
--visit_weight_envelope="linear" \
--architecture="svhn_model" \
--visit_weight=0.2 \
--max_steps=12000 \
--logdir=./log/svhn_to_mnist/reproduce
And eval script:
python semisup/eval.py \
--target_dataset="mnist3" \
--walker_weight_envelope_delay=500 \
--new_size=32 \
--dataset="svhn" \
--sup_per_batch=100 \
--decay_steps=9000 \
--unsup_batch_size=1000 \
--sup_per_class=-1 \
--walker_weight_envelope_steps=1 \
--walker_weight_envelope="linear" \
--visit_weight_envelope="linear" \
--architecture="svhn_model" \
--visit_weight=0.2 \
--max_steps=12000 \
--logdir=./log/svhn_to_mnist/reproduce
...And so far I'm getting pretty similar results in tensorboard to before -- accuracy of around 92% (to be fair, I'm only 3k iterations in so far, but still seems a fair way off...)
So this should be quite reproducible now I think. Any ideas why my results are so far off? Do I need to be using python3 maybe?
Cheers, Liam
Hi @nlml
alright so I re-ran the training myself again and everything seems fine. I uploaded for you the logs including hyper params and TFEvents so you can visualize the graph with TensorBoard: https://vision.in.tum.de/~haeusser/da_svhn_mnist.zip
The TensorFlow version was https://github.com/haeusser/tensorflow
I hope this is helpful! Philip
Thanks again - sorry to be annoying! I'll take a look and possibly try again with your tensorflow.
Liam
Hey again @haeusser
Thanks a lot again for all your help. I finally got it working :D My problem was I was evaluating on SVHN, not MNIST (see my eval code above.. doh).
Another question that's come up looking at the two papers: What is the difference in approach between Table 5 of the Learning by Association paper, and Table 2 of the Domain Adaptation paper, with regards to SVHN -> MNIST? In the former, you report an error of 0.5%, while in the latter it is 2.4%. My results (and yours) are in line with the latter. I can't seem to find what the difference in approach is in the 0.5% version, but presumably there is some major difference there.
Also, in the Domain Adaptation paper, you state that "The authors of [12] observed that higher order round trips do not improve performance." Where exactly is this stated in the Learning by Association paper? I can't seem to find this idea mentioned.
Thanks, Liam
I run this code:
python semisup/train.py \ --target_dataset="mnist3" \ --walker_weight_envelope_delay=500 \ --new_size=32 \ --dataset="svhn" \ --sup_per_batch=100 \ --decay_steps=9000 \ --unsup_batch_size=1000 \ --sup_per_class=-1 \ --walker_weight_envelope_steps=1 \ --walker_weight_envelope="linear" \ --visit_weight_envelope="linear" \ --architecture="svhn_model" \ --visit_weight=0.2 \ --max_steps=12000 \ --logdir=./log/svhn_to_mnist/reproduce
And eval script:
python semisup/eval.py \ --target_dataset="mnist3" \ --walker_weight_envelope_delay=500 \ --new_size=32 \ --dataset="svhn" \ --sup_per_batch=100 \ --decay_steps=9000 \ --unsup_batch_size=1000 \ --sup_per_class=-1 \ --walker_weight_envelope_steps=1 \ --walker_weight_envelope="linear" \ --visit_weight_envelope="linear" \ --architecture="svhn_model" \ --visit_weight=0.2 \ --max_steps=12000 \ --logdir=./log/svhn_to_mnist/reproduce
Train part is alright but evaluation part is not working. Am I doing anything wrong here?
I am getting this error: INFO:tensorflow:Waiting for new checkpoint at ./log/svhn_to_mnist/reproduce/train INFO:tensorflow:Timed-out waiting for a checkpoint.
Yes, obviously, the evaluation loop does not write any checkpoints. There might be many reasons. Maybe your disk is full or you are running the script from a different directory. If the train job fills the entire GPU, the eval job might run on the CPU and hence be very slow. Can you paste the console output from the eval job?
Cheers, Philip
@haeusser Thank you so much for your kind reply. At first I run the semisup/train.py script and after completing this file, then I run semisup/eval.py.
Should I run this two scripts together?
My disk is not full and I run the script from the same directory. Should I change any directory path in semisup/eval.py script?
Since you apparently solved the issue, as I infer from the other thread, could you quickly post what the problem was?
@haeusser Thanks. When I run the two scripts train.py and eval.py together, It was working.
"run the two scripts train.py and eval.py together,"what's this mean?How to run the py scripts together? Thank You ! This is the result.
INFO:tensorflow:Waiting for new checkpoint at ./log/svhn_to_mnist/reproduce/train INFO:tensorflow:Timed-out waiting for a checkpoint.
I can't find the accuarcy.
Very sorry to be annoying! But I really don't know how to run your code normally.
Very sorry to be annoying! But I really don't know how to run your code normally.
Did you solve the problem?
Hello,
I am reproducing the result on the paper, actually SVHN to MNIST. At first I tried to find "{stl10,svhn,synth}_tools.py" files or "package" flag on {train, eval}.py as wirrten in README, but I couldn't find it. Therefore, I made and execute .sh files below for training and evaluating with hyper parameter written in the paper.
In training, total loss and walker loss made a convergence, but the evaluation accuracy after training done showed about 0.78, not 0.976(paper accuracy).
Do you know what I am missing?
Thank you!