tensorflow / addons

Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
Apache License 2.0
1.69k stars 610 forks source link

Build against TF2.13 #2835

Closed seanpmorgan closed 1 year ago

seanpmorgan commented 1 year ago

Description

Waiting on:

seanpmorgan commented 1 year ago

@qlzh727 getting a lot of Please ensure wrapped layer is a valid Keras layer -- is there a new spec we have to comply with?

qlzh727 commented 1 year ago

I think there is a newly added check in https://github.com/keras-team/keras/commit/7eb8ef28155f20cbd78e1e24b6da5cfb656d4164#diff-5b101fb3499cadc6810aaa4ec68b961018d6d0c86ed9e9295e5ffb097e516be2 for the layer instance type checking in wrapper.

seanpmorgan commented 1 year ago

Thanks... this is causing breakage throughout the repo ... is there documentation you can share on what what changed here? Can we use a legacy serialization instead?

I'm also seeing seq2seq break for many tests... mind taking a look when time allows?

qlzh727 commented 1 year ago

From the test error log, it seems that the wrapper was feeding a tensor, rather than a layer instance, is that expected behavior?

Also @nkovela1 who is the original author for the change on keras side.

seanpmorgan commented 1 year ago

From the test error log, it seems that the wrapper was feeding a tensor, rather than a layer instance, is that expected behavior?

Also @nkovela1 who is the original author for the change on keras side.

Ack, the test was to verify an error is thrown on a Tensor input and it now raises ValueError instead of AssertionError. We can test for either or using pytest without much issue.

There is another error though that is only happening on TF2.13:

ValueError: Unknown layer: 'SpectralNormalization'. Please ensure you are using a `keras.utils.custom_object_scope` and that this object is included in the scope. See https://www.tens
orflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.

We register our layers using @tf.keras.utils.register_keras_serializable(package="Addons") . Is there a new check that is forcing a keras.utils.custom_object_scope ? Seems like a regression?

qlzh727 commented 1 year ago

Member

There is a fork of SpectralNormalization layer sent to keras with https://github.com/keras-team/keras/pull/17648, and I think you are now getting a naming conflict.

seanpmorgan commented 1 year ago

Member

There is a fork of SpectralNormalization layer sent to keras with keras-team/keras#17648, and I think you are now getting a naming conflict.

@qlzh727 Appears that this isn't related to Addons as I can make minimal reproducible examples: https://github.com/tensorflow/tensorflow/issues/61085

Can you please help get this looked at before final is cut

qlzh727 commented 1 year ago

Member

There is a fork of SpectralNormalization layer sent to keras with keras-team/keras#17648, and I think you are now getting a naming conflict.

@qlzh727 Appears that this isn't related to Addons as I can make minimal reproducible examples: tensorflow/tensorflow#61085

Can you please help get this looked at before final is cut

Ack. CC'ed related ppl on the issue.

bhack commented 1 year ago

@seanpmorgan You can cherry-pick https://github.com/tensorflow/addons/pull/2835/commits/a6ff9677ce8c641f8c7205a1e1402840793fe12e when 2.13 final is out.

seanpmorgan commented 1 year ago

Hi @qlzh727 when time allows could you look at the serialization tests for seq2seq attention layers with TF 2.13. I was able to get the serialization tests working for all other layers/losses:

E           ValueError: Input 0 of layer "memory_layer" is incompatible with the layer: expected axis -1 of input shape to have value 8, but received input with shape (None, 5, 6)
E       
E       
E       Call arguments received by layer "BahdanauAttention" (type BahdanauAttention):
E         • inputs=tf.Tensor(shape=(None, 5, 6), dtype=float32)
E         • mask=tf.Tensor(shape=(None, 5), dtype=bool)
E         • setup_memory=True
E         • kwargs={'training': 'None'}
  1. Modifying the attention layers to use_legacy_format in keras.serialization / deserialization didn't help
  2. The model itself fails to ser/de using .keras format in general
pip install -r requirements.txt
pip install -r tools/install_deps/pytest.txt
pip install tensorflow==2.13.0
python -m pytest -k "test_save_load_layer" tensorflow_addons/seq2seq/tests/attention_wrapper_test.py -vv
qlzh727 commented 1 year ago

@nkovela1, when we change the saving format internally, did we make any change to the addons tests?

nkovela1 commented 1 year ago

@qlzh727 Yes, I have changed addons tests to use the legacy format when they are not compatible with the new serialization format and the new saving. One of those affected is tensorflow_addons/seq2seq/attention_wrapper.py and its associated test.

It is not compatible with the new saving and has thus been set with the use_legacy_format=True and use_legacy_config=True flags. If the test can be disabled would be best.

qlzh727 commented 1 year ago

FYI, can u try merge the change from https://github.com/tensorflow/addons/pull/2839 and run tests again?

(I was trying to dig out the changes from our internal repo when we switch the saving format).

seanpmorgan commented 1 year ago

@qlzh727 Yes, I have changed addons tests to use the legacy format when they are not compatible with the new serialization format and the new saving. One of those affected is tensorflow_addons/seq2seq/attention_wrapper.py and its associated test.

It is not compatible with the new saving and has thus been set with the use_legacy_format=True and use_legacy_config=True flags. If the test can be disabled would be best.

So I tried to to simply flip the use_legacy_format=True and didn't see that resolving the issue, but would be happy to be wrong. (I didn't set model.use_legacy_format so it could be that)

If the test can be disabled would be best. -- please explain. Seems to me like a backwards incompatible change is causing serialization of a commonly used layer to break?

seanpmorgan commented 1 year ago

Looks like including model.use_legacy_format passes with the old serialization flags. Thanks @qlzh727 @nkovela1 for the help. IMO this is a pretty poor experience for users to have to know that models which use layers from seq2seq will need to flip that switch going forward, but these are upstream decisions that impact downstream libraries.

nkovela1 commented 1 year ago

@seanpmorgan Glad to hear it works with the old serialization flags. The majority of TF OSS packages have been debugged for compatibility with the new format along with backwards compatibility so that these flags are not necessary. However, due to TFAddons EOL, I have given TFAddons lower priority due to lack of bandwidth.

seanpmorgan commented 1 year ago

Hi @angerson do you happen to know if there were build changes for macos on TF2.13. Getting:

ERROR:delocate.libsana: /System/Library/Frameworks/Security.framework/Versions/A/Security not found:

Needed by: /Users/runner/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/sitepackages/tensorflow/libtensorflow_framework.2.dylib

Not sure why that system library wouldn't be found on the github macos runners, so curious if there was anything special done on your runners. It's only happening on TF2.13

angerson commented 1 year ago

That's strange. I'm not sure what could be going on there, no. @nitins17, do you have any ideas?

bhack commented 1 year ago

Can you check https://github.com/pytorch/pytorch/pull/85086 ?

bhack commented 1 year ago

@qlzh727 About the GPU failures on mirrorstrategy tests has something changed in the API?

qlzh727 commented 1 year ago

@qlzh727 About the GPU failures on mirrorstrategy tests has something changed in the API?

I am not aware of any change to the mirrored strategy.

bhack commented 1 year ago

@qlzh727 The test is the same as always but it is failing now. Here the log.. https://source.cloud.google.com/results/invocations/037ec635-4ae1-4658-84bc-d6ec0af40bba/log

qlzh727 commented 1 year ago

@qlzh727 The test is the same as always but it is failing now. Here the log.. https://source.cloud.google.com/results/invocations/037ec635-4ae1-4658-84bc-d6ec0af40bba/log

This is quite weird, the same test is passing for 2.12 and failing for 2.13? Do we see same error if we build with tf-nightly?

seanpmorgan commented 1 year ago

@qlzh727 The test is the same as always but it is failing now. Here the log.. https://source.cloud.google.com/results/invocations/037ec635-4ae1-4658-84bc-d6ec0af40bba/log

This is quite weird, the same test is passing for 2.12 and failing for 2.13? Do we see same error if we build with tf-nightly?

Yes, the tests pass on TF2.12: https://source.cloud.google.com/results/invocations/836b1250-2695-44d0-ad6d-b7cda5f65722/targets/tensorflow_addons%2Fubuntu%2Fgpu%2Fpy3%2Fpresubmit/log

Based on this commit: https://github.com/tensorflow/addons/pull/2835/commits/606071cc7e20246cb0ea34d3d4566e27f607faf0

seanpmorgan commented 1 year ago

Confirmed that it is TF2.13 and not the container:

https://source.cloud.google.com/results/invocations/67d3a1b3-2448-411d-9b83-2f90ab204982/targets/tensorflow_addons%2Fubuntu%2Fgpu%2Fpy3%2Fpresubmit/log

seanpmorgan commented 1 year ago

@qlzh727 The test is the same as always but it is failing now. Here the log.. https://source.cloud.google.com/results/invocations/037ec635-4ae1-4658-84bc-d6ec0af40bba/log

This is quite weird, the same test is passing for 2.12 and failing for 2.13? Do we see same error if we build with tf-nightly?

See tests above, confirmed that this is failing for TF2.13 and not TF2.12. It's also failing for tf-nightly: https://source.cloud.google.com/results/invocations/48c8ca19-45bd-433d-a259-b8cd64ed89bc/targets

qlzh727 commented 1 year ago

@qlzh727 The test is the same as always but it is failing now. Here the log.. https://source.cloud.google.com/results/invocations/037ec635-4ae1-4658-84bc-d6ec0af40bba/log

This is quite weird, the same test is passing for 2.12 and failing for 2.13? Do we see same error if we build with tf-nightly?

See tests above, confirmed that this is failing for TF2.13 and not TF2.12. It's also failing for tf-nightly: https://source.cloud.google.com/results/invocations/48c8ca19-45bd-433d-a259-b8cd64ed89bc/targets

Thanks for the confirmation, since we should have same test internally, let me see what's the status there.

seanpmorgan commented 1 year ago

@seanpmorgan Glad to hear it works with the old serialization flags. The majority of TF OSS packages have been debugged for compatibility with the new format along with backwards compatibility so that these flags are not necessary. However, due to TFAddons EOL, I have given TFAddons lower priority due to lack of bandwidth.

Just want to quickly address this comment. The upcoming EOL of TFA should not lower the priority of TFA testing. Through the years we probably caught 20-30 bugs in releases because of the number of tests we run. Checking TFA tests should help the Google team prevent breaking changes and net new bugs, not simply ensure that our builds pass so we can publish.

bhack commented 1 year ago

@qlzh727 The test is the same as always but it is failing now. Here the log.. https://source.cloud.google.com/results/invocations/037ec635-4ae1-4658-84bc-d6ec0af40bba/log

This is quite weird, the same test is passing for 2.12 and failing for 2.13? Do we see same error if we build with tf-nightly?

See tests above, confirmed that this is failing for TF2.13 and not TF2.12. It's also failing for tf-nightly: https://source.cloud.google.com/results/invocations/48c8ca19-45bd-433d-a259-b8cd64ed89bc/targets

@seanpmorgan Is it different on nightly?

RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .

qlzh727 commented 1 year ago

I just saw the similar error from keras nightly build, which is caused by a numpy version change from tf-nightly.

bhack commented 1 year ago

It seems that with 2.13 we have an error tracing:

@tf.function
def apply_gradients():
   opt.apply_gradients([(grads, var)])

device.run(apply_gradients)
bhack commented 1 year ago
    def autograph_handler(*args, **kwargs):
      """Calls a converted version of original_func."""
      try:
        return api.converted_call(
            original_func,
            args,
            kwargs,
            options=converter.ConversionOptions(
                recursive=True,
                optional_features=autograph_options,
                user_requested=True,
            ))
      except Exception as e:  # pylint:disable=broad-except
        if hasattr(e, "ag_error_metadata"):
>         raise e.ag_error_metadata.to_exception(e)
E         tensorflow.python.autograph.impl.api.StagingError: in user code:
E         
E             File "/usr/local/lib/python3.9/dist-packages/keras/src/optimizers/utils.py", line 175, in _all_reduce_sum_fn  **
E                 return distribution.extended.batch_reduce_to(
E         
E             IndexError: tuple index out of range

/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py:52: StagingError