Any limitation on the targeted embedding model?

jxmorris12 / vec2text

utilities for decoding deep representations (like sentence embeddings) back to text

Other

673 stars 75 forks source link

Any limitation on the targeted embedding model? #58

Closed qwcai closed 1 month ago

qwcai commented 1 month ago

Thanks for excellent works. I wonder if there is any limitation on the targeted embedding model.

For example, the paper ”Transferable Embedding Inversion Attack: Uncovering Privacy Risks in Text Embeddings without Model Queries“ says "Along similar lines, the following research (Morris et al., 2023) further reveals that an adversary can recover 92% of a 32-token text in- put given embeddings from a T5-based pre-trained transformer."

Does this work only support to the inversion attack on T5-based embedding?
To train the inversion model for other embedding model, do we have any guideline, for example, how to set "--model_name_or_path "? Could you help to provide an example for the bert-like embeddimg model

jxmorris12 commented 1 month ago

Hm. I don't know of this paper but it sounds interesting. Our embedding attack should work on any embedding; in our paper we try it on embeddings from the openAI API which are very likely not using T5.

qwcai commented 1 month ago

Thanks for the reply,

I want to check whether a bert-like embedding model immune to this attack. The model is described in this paper https://arxiv.org/pdf/2309.07597#page=1.28, and is "based on the BERT-like architecture"

I use the example parameters, that is "--model_name_or_path t5-base --dataset_name msmarco --embedder_model_name gtr_base --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --bf16=1 --use_wandb=1 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup --exp_group_name oct-gtr --learning_rate 0.001 --output_dir ./saves/gtr-1 --save_steps 2000" for the base model and also the example parameters for the corrector model.

It seems that for this model the inversion work poor, could you provide some suggestions for the training parameters? Thanks

jxmorris12 commented 1 month ago

What do you mean when you say that this configuration works poorly? It looks like your command is set up to use GTR-base embeddings which is not correct.

qwcai commented 1 month ago

For inversion work poor, I mean that the recovered sentence is incorrect. So I guess whether there is different commands for different embedding models, for example, whether I need to change the "--model_name_or_path " and "--dataset_name"?

I use the target embedding to replace "gtr_base" in " --embedder_model_name". Do other configurations in the command need to change? And, the embedding size is 1024, does it matter?

Most importantly, I find the tokenizer.json of the output only contains English words, does the solution works for other language?

jxmorris12 commented 1 month ago

We use the T5 inverter which is trained on English, French, Romanian, and German. If you want to use another inversion backbone you can, but you might have to change some code. The embedding size should be handled dynamically and shouldn't make a difference.