keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.47k forks source link

training: result is different if model passed as argument to training loop #16240

Closed dwight-nwaigwe closed 1 year ago

dwight-nwaigwe commented 2 years ago

Please go to TF Forum for help and support:

https://discuss.tensorflow.org/tag/keras

If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with the documentation (for small docs fixes please send a PR instead). The form below must be filled out.

Here's why we have that policy:.

Keras developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information.

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the problem.

Training a model should is different depending if I pass the model as an argument to the custom training (sub)functions.

Describe the current behavior.

I have experimented with training a simple tensorflow model using two scenarios: passing my model to my training loop (and to the subfunctions which are called from training loop), versus not passing my model to the training loop. The two cases result in different results. When passing my model to the training functions, the model is trained properly. But in the second scenario, something is wrong because the model is apparently not trained correctly. I am baffled, and I wonder if it's a scope thing.

To be more specific, my setup involves dynamically creating a new model of larger size (adding some layers at each iteration of a for loop), and then training the resulting model. As stated before, I train the model in two scenarios: passing the model to training subfunctions and not doing so, and I obtain different results depending on which one I do. I verify how well the model is trained by giving the model a test sample (class 0 MNIST images) and checking if the correct classification is output. The models trained by passing the model as an argument are trained correctly. If I don't pass the model as an argument to training function and subfunctions, then only the first model created by the for loop is correctly trained. Can this be explained?

Describe the expected behavior.

Training a model should not be different whether or not I pass the model as an argument to the custom training (sub)functions.

Contributing.

Standalone code to reproduce the issue. code is attached. training_issue.txt

Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Source code / logs.

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

tilakrayal commented 2 years ago

@animalcroc , Can you please try to reduce the size of epochs and also try to execute in v2.8 and let us know if you are facing same issue.Thanks!

dwight-nwaigwe commented 2 years ago

@tilakrayal I reduced the epochs to 50, and upgraded to v2.8 but the issue persists. Perhaps this is a feature (albeit a dangerous one)

google-ml-butler[bot] commented 2 years ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

dwight-nwaigwe commented 2 years ago

@tilakrayal I reduced the epochs to 50, and upgraded to v2.8 but the issue persists. Perhaps this is a feature (albeit a dangerous one)

@tilakrayal Are you able to follow up with this issue?

dwight-nwaigwe commented 2 years ago

@animalcroc , Can you please try to reduce the size of epochs and also try to execute in v2.8 and let us know if you are facing same issue.Thanks!

@tilakrayal Please let me know if you are still able to assist

tilakrayal commented 2 years ago

@sachinprasadhs , I was able to reproduce the issue in tf v2.8 and nightly.Please find the gist here.

sachinprasadhs commented 1 year ago

Hello, Thank you for reporting an issue.

We're currently in the process of migrating the new Keras 3 code base from keras-team/keras-core to keras-team/keras. Consequently, This issue may not be relevant to the Keras 3 code base . After the migration is successfully completed, feel free to reopen this Issue at keras-team/keras if you believe it remains relevant to the Keras 3 code base. If instead this Issue is a bug or security issue in legacy tf.keras, you can instead report a new issue at keras-team/tf-keras, which hosts the TensorFlow-only, legacy version of Keras.

To know more about Keras 3, please read https://keras.io/keras_core/announcement/

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No