jxmorris12 / vec2text

utilities for decoding deep representations (like sentence embeddings) back to text
Other
673 stars 75 forks source link

Missing Pooling Layer in sentence transformers #28

Closed sciencecw closed 7 months ago

sciencecw commented 7 months ago

This is partially an issue for sentence_transformer or huggingface. Sentence Transformers loaded from huggingface hub only return the encoder and tokenizer. For example, you have to rely on this function to apply a final mean pooling layer to generate the embeddings:

def get_gtr_embeddings(text_list,
                       encoder: PreTrainedModel,
                       tokenizer: PreTrainedTokenizer) -> torch.Tensor:

    inputs = tokenizer(text_list,
                       return_tensors="pt",
                       max_length=128,
                       truncation=True,
                       padding="max_length",).to("cuda")

    with torch.no_grad():
        model_output = encoder(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
        hidden_state = model_output.last_hidden_state
        embeddings = vec2text.models.model_utils.mean_pool(hidden_state, inputs['attention_mask'])

    return embeddings

But in GTR this is actually not the embedding coming out of sentence-transformers library, because a further Dense layer is applied. While the dense layer is stored in huggingface register It's not clear you can get this from transformers library

In [30]: from sentence_transformers import SentenceTransformer
    ...: sentences = ["This is an example sentence", "Each sentence is converted"]
    ...:
    ...: stmodel = SentenceTransformer('sentence-transformers/gtr-t5-base')

In [31]: stmodel
Out[31]:
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()

Just a suggestion, would it be better to use sentence-transformers class as default, and use transformers only as fall-back?

jxmorris12 commented 7 months ago

Thanks for documenting. we discussed this on #25 here: https://github.com/jxmorris12/vec2text/pull/25#issuecomment-1870788435

I agree this is an issue! I'm happy to change the default, since it's difficult to use GTR with transformers; how would you suggest we go about doing that?

sciencecw commented 7 months ago

I see that in the inversion layer, you have already incorporated some SentenceTransformer support. I wonder if you just need to retrain your default inversion model? Or if something else is missing?

(but refer to the other issue where I tested your all-MiniLM-L6-v2 checkpoint but it didn't quite work?)

ArvinZhuang commented 7 months ago

Hi @jxmorris12 @sciencecw

I have trained a vec2text model with the correct gtr-t5-base embeddings. you can find it here: ielabgroup/vec2text_gtr-base-st_corrector

I trained it on nq dataset, max length 32 and batch size 512 for 50 epochs. I have tested it with the evaluation code provided in the readme. it seems better than jxm/gtr__nq__32

Example code:


from sentence_transformers import SentenceTransformer
import vec2text
import transformers

inversion_model = vec2text.models.InversionModel.from_pretrained(
    "ielabgroup/vec2text_gtr-base-st_inversion"
)
model = vec2text.models.CorrectorEncoderModel.from_pretrained(
    "ielabgroup/vec2text_gtr-base-st_corrector"
)

inversion_trainer = vec2text.trainers.InversionTrainer(
    model=inversion_model,
    train_dataset=None,
    eval_dataset=None,
    data_collator=transformers.DataCollatorForSeq2Seq(
        inversion_model.tokenizer,
        label_pad_token_id=-100,
    ),
)

model.config.dispatch_batches = None
corrector = vec2text.trainers.Corrector(
    model=model,
    inversion_trainer=inversion_trainer,
    args=None,
    data_collator=vec2text.collator.DataCollatorForCorrection(
        tokenizer=inversion_trainer.model.tokenizer
    ),
)

model = SentenceTransformer('sentence-transformers/gtr-t5-base')
embeddings = model.encode([
       "Jack Morris is a PhD student at Cornell Tech in New York City",
       "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
], convert_to_tensor=True,).to('mps')

vec2text.invert_embeddings(
    embeddings=embeddings,
    corrector=corrector,
    num_steps=20,
)

['         Jack Morris is a PhD student at Cornell Tech in New York', 'It was the best of times, it was the worst of times, it was the epoch of incredulity, it was age of']
jxmorris12 commented 7 months ago

Thank you so much! This is a super useful contribution!!

ArvinZhuang commented 7 months ago

@jxmorris12 Glad to contribute! :)

I have a question though. I trained the model on one H100 gpu, and the full pipeline (including inversion and corrector training) takes about 5 days.. very slow. Does this sound correct to you? In your paper you report training takes 2 days with 4 A6000 gpus but I tested training with 2 H100 and seems distributed training doesn't speed up things a lot.

jxmorris12 commented 7 months ago

Actually this does seem right. I think the durations in the paper aren't right for the corrector model. (I should probably update that since I've been getting a lot of questions about it, sorry.) Doubling the number of GPUs should give you a 2x speedup though if you're using distributed data parallel w/ torchrun like I described.

jxmorris12 commented 7 months ago

Seems like this is fixed, then, thanks to @ArvinZhuang!! :)