jxmorris12 / vec2text

utilities for decoding deep representations (like sentence embeddings) back to text
Other
673 stars 75 forks source link

Support new text-embedding-3 small and large models from OpenAI #35

Open tipani86 opened 6 months ago

tipani86 commented 6 months ago

Hey there, interesting project!

Since OpenAI introduced new and even more powerful embedding models recently, text-embedding-3-small and text-embedding-3-large, I was wondering could support for them also be added?

I'm looking into testing this out with both the original ada-2 and the better embedding-3 models.

Edit: the intended application domain would not be in the adversarial privacy attacks but instead sentence translation where getting the exact words right is not so crucial, as long as the general idea gets carried over.

Edit2: I already did a few quick tests using your provided Colab demo. Put in some Chinese sentences in, converted to embedding vectors using ada-2, and then recover in English using default pretrained models. It performed "okay-ish" with Chinese, but surprisingly using Swedish or Finnish as origin language produced weird, non-English results, as if your inversion model also included non-English tokens as training data. Or maybe the sentences I tried in those languages were out-of-distribution in relation to the pre-trained inversion or correction models, in any case, it was really just a quick test.

I think ideally, each destination language would need to have a pretrained inversion model using a dataset which only includes training samples in the destination language, and we'd just switch around the decoder model depending on what the output language is supposed to be, right?

You could also go super granular in that you could pretrain a domain-specific inversion model, or even customer-specific one, if say you are serving a large client with their very specific internal jargon... no?