themurtazanazir / vec2text

utilities for decoding deep representations (like sentence embeddings) back to text
Other
1 stars 0 forks source link

Reverse-direction decoding #8

Open mattf1n opened 3 days ago

mattf1n commented 3 days ago

1: it will do better, more general 2: pre-taining hurts the model.

themurtazanazir commented 2 days ago

@mattf1n do we also reverse the order of <bos> and <eos> tokens? if prompt was <bos>t1 t2 t3<eos>, do we train on <bos>t3 t2 t1<eos> or <eos>t3 t2 t1<bos>?

mattf1n commented 2 days ago

Oh interesting question. I think we can just try both with and without and see if it makes a difference? Let's start by treating all tokens equally, i.e., no special treatment for special tokens. Also I'm surprised that a prompt would end in an EOS token?

themurtazanazir commented 1 day ago

we can't treat "all" special tokens equally(well, we can. but should we?), specifically the <pad> tokens. We don't wanna output a bunch of <pad> tokens at the beginning of each sentence.

Also I'm surprised that a prompt would end in an EOS token? Yes especially the t5 tokenizer adds an </s> token.

To reproduce:

import transformers
t = transformers.AutoTokenizer.from_pretrained(
    "t5-base"
)
tokenized = t(["hi how are you"])
print(tokenized)
print(t.batch_decode(tokenized['input_ids']))
{'input_ids': [[7102, 149, 33, 25, 1]], 'attention_mask': [[1, 1, 1, 1, 1]]}
['hi how are you</s>']

1 is the </s> token added by t5 tokenizer.

mattf1n commented 1 day ago

Yeah you are right. Now I think we should tokenize without special tokens, reverse order, then add special tokens.