keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.41k stars 19.4k forks source link

Potential bug in legacy h5 weights loading. #19694

Open torzdf opened 2 months ago

torzdf commented 2 months ago

This may be working as intended, in which case feel free to close this issue.

There is a difference in top_level_model_weights loading between the functions load_weights_from_hdf5_group_by_name and load_weights_from_hdf5_group_by

Specifically, the load_weights_from_hdf5_group_by_name collects symbolic weights using the following code:

symbolic_weights = model.trainable_weights + model.non_trainable_weights

https://github.com/keras-team/keras/blob/da83683f5e92fa24a0ad7bf5dc034ea596346d21/keras/src/legacy/saving/legacy_h5_format.py#L485

In my usecase this returns 200 weights

Meanwhile, load_weights_from_hdf5_group collects symbolic weights using the following code:

        symbolic_weights = list(
            # model.weights
            v
            for v in model._trainable_variables + model._non_trainable_variables
            if v in model.weights
        )

https://github.com/keras-team/keras/blob/da83683f5e92fa24a0ad7bf5dc034ea596346d21/keras/src/legacy/saving/legacy_h5_format.py#L379

In my usecase this returns 0 weights (in fact model._trainable_variables + model._non_trainable_variables returns 0 weights even without filtering for variables within model.weights

I would expect both of these loading methodologies to return the same number of symbolic weights, but this is clearly not the case

I suspect that by_name is incorrect whilst the alternative is correct, but cannot be sure.

SuryanarayanaY commented 2 months ago

Hi @torzdf ,

Thanks for reporting. It would be appreciated if you can submit a code snippet for reproducing the error.

torzdf commented 2 months ago

Hi @torzdf ,

Thanks for reporting. It would be appreciated if you can submit a code snippet for reproducing the error.

Not straightforward, as I discovered this by injecting print statements at the above linked locations.

Either way, I have demonstrated where the code diverges, The difference is clear to see from the different variables that are used to collect the symbolic weights.

This was discovered by loading a model from Keras 2 into Keras 3 which adds further challenges providing reproducible code. I do not know if this issue exists for legacy models generated in Keras 3 as I resolved the issue my end by not using the by_name variant.

However, I raised this issue as I thought it may be a bug. so thought you may wish to be aware. It also may not be a bug.

At this point, what you do with this information is up to you.

Pagey commented 3 weeks ago

i can confirm this bug occurs in tensorflow 2.16 for models created saved and loaded on that version with the h5 format