huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.97k stars 26.79k forks source link

Inconsistencies between `nn.functional.interpolate` and `tf.image.resize` #18811

Closed sayakpaul closed 1 year ago

sayakpaul commented 2 years ago

System Info

Upsampling of intermediate feature maps is common for computer vision models. In PyTorch, it's usually implemented using nn.functional.interpolate. For TensorFlow, it's usually tf.image.resize.

But there is an inconsistency between what these two methods yield (Colab Notebook). Sometimes the differences in their outputs are small enough to ignore (ViT, for example). But when interpolation is repeated several times in a model (MobileViT, for example), these small differences can add up and shake the final outputs of a model quite a bit. More details here.

@hollance wrote an amazing blog post discussing this issue.

Who can help?

@amyeroberts @gante @Rocketknight1

Information

Tasks

Reproduction

The Colab Notebook mentioned in the description.

Expected behavior

We should work on a TF utility that yields the same output as nn.functional.interpolate.

hollance commented 2 years ago

Note that align_corners=None does give the same result as tf.image.resize, to an absolute tolerance of 1e-6 or so:

upsampled_logits_pt = torch.nn.functional.interpolate(
    dummy_logits_pt, size=interp_shape, mode="bilinear", align_corners=None
)
ariG23498 commented 2 years ago

I had to face a similar probelem.

With the torch.nn.functional.interpolate:

Here is a colab notebook that details the solution that I proposed.

@amyeroberts and @gante were against using the tf.compat.v1.image.resize for obvious reasons. @amyeroberts did come up with a solution which is documented in this comment. I hope this provides some value to this thread.

sayakpaul commented 2 years ago

Thanks @hollance and @ariG23498 for chiming in.

When there's one / two interpolation ops, the small differences are fine but when they are done several times (as mentioned earlier), these differences compound up which creates mismatches in the final outputs.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Rocketknight1 commented 2 years ago

Pinging this to keep it open - we've implemented TF versions of Torch ops like AdaptivePool, so this might be something to explore as well. I hope a performant solution can be implemented with native ops (with or without XLA compilation) without needing to write our own CUDA, though!

sayakpaul commented 2 years ago

Pinging this to keep it open - we've implemented TF versions of Torch ops like AdaptivePool, so this might be something to explore as well.

Could you point me to something relevant? Would love to see the new implementations.

Rocketknight1 commented 2 years ago

Hi @sayakpaul, I'm extremely sorry! I thought we'd implemented it in data2vec2 already, but checking the code it seems like we still have the old version there. Here's a much, much more performant version.

Rocketknight1 commented 2 years ago

The new version will also allow XLA compilation, unlike the original sparse implementation

sayakpaul commented 2 years ago

Oh yeah, I remember this one. I need to update the data2vec code with your latest implementation. Reminded me of that.

Thanks, Matt!

Rocketknight1 commented 2 years ago

Although, checking out that notebook, it seems like Pytorch's interpolate and TF's resize are using the same algorithm, and the differences are mostly numerical; when I switch the dtype to float64, the max absolute differences is 1e-7.

I think this will make it extremely hard to bring the accuracies any closer - we would have to rewrite the internals of the algorithm so that they're not just mathematically equivalent, but they lose precision in the same places as well. I think we're stuck with the issue, unfortunately!

sayakpaul commented 2 years ago

Fair enough. The tiny differences just add up when there's multiple such interpolations. Otherwise, it's not a big deal.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ariG23498 commented 2 years ago

Ping.

I don't think it is resolved yet.

sayakpaul commented 2 years ago

I doubt it will be apples-to-apples resolved ever, considering https://github.com/huggingface/transformers/issues/18811#issuecomment-1262563941.

We need to be aware when comparing predictions of models (PT vs. TF) with stacks of interpolations, especially focusing on what tolerances we're using.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.