tensorflow / skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning
Apache License 2.0
3.18k stars 439 forks source link

Loss scores are different for contiguous run of fit() for 200 steps and 4 runs of fit() for 50 steps #88

Closed olegarch closed 8 years ago

olegarch commented 8 years ago

I am doing regression with DNN.

Final MSE for contiguous run of 200 steps: 1.45781016655 Final MSE for 4 runs with 50 steps each: 1.44524233948

Score for contiguous run: Step #1, epoch #1, avg. loss: 27.95941 Step #21, epoch #21, avg. loss: 5.64051 Step #41, epoch #41, avg. loss: 1.78990 Step #61, epoch #61, avg. loss: 1.53639 Step #81, epoch #81, avg. loss: 1.49865 Step #101, epoch #101, avg. loss: 1.48255 Step #121, epoch #121, avg. loss: 1.47312 Step #141, epoch #141, avg. loss: 1.46747 Step #161, epoch #161, avg. loss: 1.46394 Step #181, epoch #181, avg. loss: 1.46122

Score for 4 runs 50 steps each: Step #1, epoch #1, avg. loss: 27.95941 Step #6, epoch #6, avg. loss: 13.49244 Step #11, epoch #11, avg. loss: 4.11436 Step #16, epoch #16, avg. loss: 2.69326 Step #21, epoch #21, avg. loss: 2.26197 Step #26, epoch #26, avg. loss: 2.02976 Step #31, epoch #31, avg. loss: 1.79997 Step #36, epoch #36, avg. loss: 1.71287 Step #41, epoch #41, avg. loss: 1.61699 Step #46, epoch #46, avg. loss: 1.56702

Step #51, epoch #1, avg. loss: 1.52925 Step #56, epoch #6, avg. loss: 1.52344 Step #61, epoch #11, avg. loss: 1.51318 Step #66, epoch #16, avg. loss: 1.50661 Step #71, epoch #21, avg. loss: 1.50114 Step #76, epoch #26, avg. loss: 1.49584 Step #81, epoch #31, avg. loss: 1.49099 Step #86, epoch #36, avg. loss: 1.48698 Step #91, epoch #41, avg. loss: 1.48371 Step #96, epoch #46, avg. loss: 1.48097

Step #101, epoch #1, avg. loss: 1.47760 Step #106, epoch #6, avg. loss: 1.47609 Step #111, epoch #11, avg. loss: 1.47386 Step #116, epoch #16, avg. loss: 1.47201 Step #121, epoch #21, avg. loss: 1.47048 Step #126, epoch #26, avg. loss: 1.46914 Step #131, epoch #31, avg. loss: 1.46795 Step #136, epoch #36, avg. loss: 1.46686 Step #141, epoch #41, avg. loss: 1.46591 Step #146, epoch #46, avg. loss: 1.46506

Step #151, epoch #1, avg. loss: 1.46384 Step #156, epoch #6, avg. loss: 1.46348 Step #161, epoch #11, avg. loss: 1.46276 Step #166, epoch #16, avg. loss: 1.46212 Step #171, epoch #21, avg. loss: 1.46144 Step #176, epoch #26, avg. loss: 1.46086 Step #181, epoch #31, avg. loss: 1.46028 Step #186, epoch #36, avg. loss: 1.45976 Step #191, epoch #41, avg. loss: 1.45914 Step #196, epoch #46, avg. loss: 1.45857

ilblackdragon commented 8 years ago

There is one reason why this can be happening - every time fit restarts, the new re-sampling happens inside data_feeder and different order of data is seen by the model.

I'll look into it more tomorrow, to check that this is indeed the only reason.

Otherwise, a better comparison would be to let it train until convergence (e.g. loss doesn't go down anymore) for both cases and it should result in a very similar score.

olegarch commented 8 years ago

After run for 50 steps I was executing session to get test set results and that was changing random generator state in dropout operation. Dropout is still exercised for non-training step with probability 1. So subsequent training was slightly different from contiguous run. If I don't execute session to get test results or remove dropout layer - training results for contiguous and non-contiguous runs match.

ilblackdragon commented 8 years ago

Yeah, one option is to try to remove dropout at all in the non-training case (e.g. a tf.cond(is_training, dropout(x, prob), x)). Do you feel this difference is a big deal or after enough iterations it converges to the same results?

olegarch commented 8 years ago

It's not a big deal. Convergence is similar in both cases.

ilblackdragon commented 8 years ago

Closing as WAI.