Open Avafish opened 12 months ago
We used a subset of the CoNLL dataset (see https://github.com/ganeshjawahar/interpret_bert about how to obtain the features and labels) and simply passed the features through roberta's original tokenizer and embedder to get text_xs.npy. text_ys.npy will just be the labels.
Hi junhong thank you very much for your work and sharing! May I ask that when you pass CoNLL tokens to roberta's embedder, are positional embeddings added to the final embeddings?
yes, we perform distribution matching using the embedded features with the positional embeddings added. (you can see this from our implementation of the task-specific embedder, which has a positional embedding layer in it). hope this helps!
On Fri, Nov 17, 2023 at 10:13 PM Zenvi @.***> wrote:
Hi junhong thank you very much for your work and sharing! May I ask that when you pass CoNLL tokens to roberta's embedder, are positional embeddings added to the final embeddings?
— Reply to this email directly, view it on GitHub https://github.com/sjunhongshen/ORCA/issues/4#issuecomment-1817344778, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALOYUGHZ3L4TTENWG4Y4MLDYFARV3AVCNFSM6AAAAAA6VJQA76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXGM2DINZXHA . You are receiving this because you commented.Message ID: @.***>
Thank you deeply for your reply!
I am now trying to switch the pretrained models to decoder-only language models (GPT2 & OPT). So I'm first trying to recover the process of extracting embeddings from CoNLL dataset using RoBERTa, such that similar process can be applied to new pretrained models.
If you may, could you please help me solve the following confusion of mine?
[CLS], token1, token2, token3, ..., token29, [PAD], [SEP]
And also based on the observation on text_xs.npy: it has 2000*32 embeddings, for every 32 embeddings (which I presume is the predefined length of each sentence), it has exactly the same first embedding (of length 768):
[0.0040, -0.0641, -0.2424, -0.0008, ..., 0.0532, -0.1747, 0.0889, 0.0088]
I guess this shall be the embedding of RoBERTa's [CLS] token.
However, when I use a roberta-base model to extract the embedding from the [CLS] token:
from transformers import AutoTokenizer, RobertaModel
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")
model.eval()
inputs = tokenizer("we were wrong.", return_tensors="pt")
print(inputs['input_ids'])
embs = model.embeddings(inputs['input_ids'])
print(embs[0,0])
I got the following:
tensor([[ 0, 1694, 58, 1593, 4, 2]])
tensor([ 1.6637e-01, -5.4084e-02, -1.3613e-03, -3.3884e-03, ..., -1.8384e-02, -8.1125e-02, 7.9372e-02, 1.5456e-02], grad_fn=<SelectBackward0>)
which do not match the first embeddings in text_xs.npy. I guess the roberta model I use is not the roberta model used to extracted embeddings? or is it the way I extract embeddings was wrong?
Thank you for your great work! I'm just wondering how to get the text_xs.npy and text_ys.npy since I'm trying replacing the roberta model with other transformer models. You didn't mention it in the readme file so they are generic? Looking forward to your reply.