Closed Nilabhra closed 2 years ago
I can confirm that this behavior is not present in other sentence encoders. Here's a Colab that verifies that: https://colab.research.google.com/gist/sayakpaul/c59d855e14a98a93362d3735ea67e6d2/scratchpad.ipynb.
Ccing @WGierke
Thank you @Nilabhra for filing this issue and @sayakpaul for the Colab! I'm not sure what is causing this so I reached out to the model authors. I'll update the thread once they respond.
The model authors replied that the difference in embeddings is very small and likely due to some numerical instability of some underlying op. The difference should not affect any downstream usages, so there is no plan to fix this.
I'm going to mark this issue as closed, but feel free to reopen or comment if you have any follow-up questions.
This is very unlikely. As mentioned here this is not a problem with other sentence encoders.
The sentence encoders from the Colab are using PyTorch (at least I think so by looking at the output of pip install -U sentence-transformers
). https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1 uses TensorFlow. This is a significant difference so this might explain why the rest of the models are numerically stable.
Yeah, you are absolutely right. But numerical instability is indeed a point of concern. We run a few more experiments with https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1 to verify if this numerical instability leads to performance degradation.
Will you be able to communicate if something concerning comes up?
Absolutely, in case you spot any performance degradation please report it - we'll forward the concerns to the model authors.
After loading the CMLM model (available here: https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1), if the
pooled_output
is obtained on the same sequences repeatedly, I can see variations in the output embeddings.This colab notebook replicates this issue: https://colab.research.google.com/drive/1iUUwNBQWaWJRgZ1ExMuzNhJyEfWCGyRJ?usp=sharing