kwotsin / transfer_learning_tutorial

A guide to transfer learning with inception-resnet-v2.
232 stars 80 forks source link

why other dataset 's result is zero? Final Accuracy: 0.0 #9

Closed fanyuzeng closed 7 years ago

fanyuzeng commented 7 years ago

I transform the datasets (flowers) to tfrecords as your github shows, and the trainning performs correct. However I change the datasets to another (17flowers), the structure as following: flowers\ flower_photos\ 0\ ....jpg ....jpg ....jpg 1\ ....jpg 2\ ....jpg 3\ ....jpg . . .

    16\
        ....jpg

and the tfrecord is generated correctly. Then I modify the relevant directory to adapt to my code, and also change ' num_classes = 17'. However, the result as follows:

/usr/bin/python2.7 /home/cr/PycharmProjects/transferLearning/train_flowers.py 2017-07-19 22:25:03.216673: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-07-19 22:25:03.216740: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-07-19 22:25:03.216760: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-07-19 22:25:03.581012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:03:00.0 Total memory: 11.90GiB Free memory: 11.41GiB 2017-07-19 22:25:03.581046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 2017-07-19 22:25:03.581054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y 2017-07-19 22:25:03.581067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:03:00.0) INFO:tensorflow:Restoring parameters from ./preTrainModels/inception_resnet_v2_2016_08_30.ckpt 2017-07-19 22:25:27.233384: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1901 get requests, put_count=1100 evicted_count=1000 eviction_rate=0.909091 and unsatisfied allocation rate=1 2017-07-19 22:25:27.233436: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_sizelimit from 100 to 110 INFO:tensorflow:Starting standard services. INFO:tensorflow:Saving checkpoint to path ./log/model.ckpt INFO:tensorflow:Starting queue runners. INFO:tensorflow:Final Loss: Tensor("softmax_cross_entropy_loss/value:0", shape=(), dtype=float32) INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:Final Accuracy: 0.0 INFO:tensorflow:Finished training! Saving model to disk now. INFO:tensorflow:global_step/sec: 0 Process finished with exit code 0

How to resolve this problem.thank you very much!

kwotsin commented 7 years ago

It seems that your model didn't train at all. By right there should be some training steps appearing. Can you check the tfrecord sizes to make sure they are not zero? Also, it could be some error in the training loop that causes your model training to not run at all.

fanyuzeng commented 7 years ago

Many thanks. The tfrecords are not zeros.
And I train other emotion datasets, and the result is also zeros. I don't know how to solve this problem. Could you give me some advice?

kwotsin commented 7 years ago

It would be hard to tell without seeing your code, since the program runs to completion. Could you post your code in a Gist? I would suggest to use print statements to debug and check whether your training loop is running at all.

fanyuzeng commented 7 years ago

Ok, the code is the same as yours, I just tune some parameters . I post my code soon.

fanyuzeng commented 7 years ago

The following is the file. Most of the code is the same as yours. Many thanks to you, and sorry to upload now. https://github.com/fanyuzeng/transferlearning_InceptionResnetV2/blob/master/tflearn_Inception_resnetV2.py

fanyuzeng commented 7 years ago

Could you give me some advice?

kwotsin commented 7 years ago

I believe the error is coming from num_samples = 0. You can try printing it to verify that num_samples = 0. So the problem comes from the line : https://github.com/fanyuzeng/transferlearning_InceptionResnetV2/blob/master/tflearn_Inception_resnetV2.py#L87

The code searches for the TFrecord files to count the number of samples in it, so since the default is 'flowers', the tfrecords are not counted at all, resulting in num_samples = 0. Then because the training loop depends on the value of num_samples, it doesn't run at all. I have made some changes to the code, so you can change file_pattern_for_counting to your tfrecord name, for example, 'emotions' instead of 'flowers'. It should work now.

fanyuzeng commented 7 years ago

Great,it works now! It's very kind of you, thank you very much!