jxmorris12 / vec2text

utilities for decoding deep representations (like sentence embeddings) back to text
Other
673 stars 75 forks source link

Question about embedding_transform #38

Open icerooqiu opened 5 months ago

icerooqiu commented 5 months ago

Hi

Thank you for presenting your research. I have a question regarding the embedding_transform in inversion.py. As per my understanding, this function corresponds to the MLP model described in your publication, responsible for transforming logits into pseudo-embeddings, or what is referred to as the 'zero-step' model. Could you elaborate on how this MLP model was trained to ensure it generates meaningful predictions? The paper and readme seem to lack detailed information on this aspect, and any additional insights you could provide would be greatly appreciated. Thank you.

jxmorris12 commented 5 months ago

Hi @icerooqiu -- we train the whole model end-to-end to generate text conditioned on embeddings. So the MLP layer is updated via gradient descent to try to make the correct text more likely given the input embedding. Does that answer your question?

icerooqiu commented 5 months ago

Hi @icerooqiu -- we train the whole model end-to-end to generate text conditioned on embeddings. So the MLP layer is updated via gradient descent to try to make the correct text more likely given the input embedding. Does that answer your question?

Thank you for replying, I have another questions about the model training. I have two more questions about the model,

  1. I am not fully understand about this code, logits = outputs.logits[torch.arange(B), attention_mask.sum(1) - 1] from function _process_embedder_output. I have tested this code and it returns the one logit only. So, assume there's one text only and input ids in length N and vocab size is M, the valid logits size is (1, N, M) and this code gives (1, M). My question is, do we use one logit only for the following prediction? What about the rest of them.
  2. Another question is about the Mock Embedding, when should we use it? I tried your instruction from readme python vec2text/run.py --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --max_seq_length 128 --num_train_epochs 100 --max_eval_samples 1000 --eval_steps 25000 --warmup_steps 100000 --learning_rate 0.0002 --dataset_name one_million_instructions --model_name_or_path t5-base --use_wandb=0 --embedder_model_name gpt2 --experiment inversion_from_logits_emb --bf16=1 --embedder_torch_dtype float16 --lr_scheduler_type constant_with_warmup --use_frozen_embeddings_as_input 1 --mock_embedder 0, But I got a out of memory issue when training, so I set mock_embedder to be true, then model works but training result is terrible. I am not quite sure if mock_embedder is the reason, I think all the embedder has been pre-computed and saved inside cache directory.