How can I get the text_xs and text_ys

Avafish commented 12 months ago

Thank you for your great work! I'm just wondering how to get the text_xs.npy and text_ys.npy since I'm trying replacing the roberta model with other transformer models. You didn't mention it in the readme file so they are generic? Looking forward to your reply.

sjunhongshen commented 11 months ago

We used a subset of the CoNLL dataset (see https://github.com/ganeshjawahar/interpret_bert about how to obtain the features and labels) and simply passed the features through roberta's original tokenizer and embedder to get text_xs.npy. text_ys.npy will just be the labels.

Zenvi commented 11 months ago

Hi junhong thank you very much for your work and sharing! May I ask that when you pass CoNLL tokens to roberta's embedder, are positional embeddings added to the final embeddings?

sjunhongshen commented 11 months ago

yes, we perform distribution matching using the embedded features with the positional embeddings added. (you can see this from our implementation of the task-specific embedder, which has a positional embedding layer in it). hope this helps!

On Fri, Nov 17, 2023 at 10:13 PM Zenvi @.***> wrote:

Hi junhong thank you very much for your work and sharing! May I ask that when you pass CoNLL tokens to roberta's embedder, are positional embeddings added to the final embeddings?

— Reply to this email directly, view it on GitHub https://github.com/sjunhongshen/ORCA/issues/4#issuecomment-1817344778, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALOYUGHZ3L4TTENWG4Y4MLDYFARV3AVCNFSM6AAAAAA6VJQA76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXGM2DINZXHA . You are receiving this because you commented.Message ID: @.***>

Zenvi commented 11 months ago

Thank you deeply for your reply!

I am now trying to switch the pretrained models to decoder-only language models (GPT2 & OPT). So I'm first trying to recover the process of extracting embeddings from CoNLL dataset using RoBERTa, such that similar process can be applied to new pretrained models.

If you may, could you please help me solve the following confusion of mine?

For pretrained language models, the source dataset is CoNLL-2000 or CoNLL-2003? I used https://github.com/ganeshjawahar/interpret_bert to process both CoNLL-2000 and CoNLL-2003 and found the number of classes of CoNLL-2000 is 7, which matches the number of classes in text_ys.npy. However, those for CoNLL-2003 don't match.
Which version of RoBERTa model is used for extracting source dataset embeddings? After observing the json file produced by https://github.com/ganeshjawahar/interpret_bert, I presume the text tokens input to the embedder of RoBERTa have the following structure (supposing the sentence has 29 tokens): [CLS], token1, token2, token3, ..., token29, [PAD], [SEP] And also based on the observation on text_xs.npy: it has 2000*32 embeddings, for every 32 embeddings (which I presume is the predefined length of each sentence), it has exactly the same first embedding (of length 768): [0.0040, -0.0641, -0.2424, -0.0008, ..., 0.0532, -0.1747, 0.0889, 0.0088] I guess this shall be the embedding of RoBERTa's [CLS] token. However, when I use a roberta-base model to extract the embedding from the [CLS] token:
```
from transformers import AutoTokenizer, RobertaModel
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = RobertaModel.from_pretrained("roberta-base")
model.eval()
inputs = tokenizer("we were wrong.", return_tensors="pt")
print(inputs['input_ids'])
embs = model.embeddings(inputs['input_ids'])
print(embs[0,0])
```

I got the following:

tensor([[   0, 1694,   58, 1593,    4,    2]])
tensor([ 1.6637e-01, -5.4084e-02, -1.3613e-03, -3.3884e-03, ..., -1.8384e-02, -8.1125e-02,  7.9372e-02,  1.5456e-02], grad_fn=<SelectBackward0>)

which do not match the first embeddings in text_xs.npy. I guess the roberta model I use is not the roberta model used to extracted embeddings? or is it the way I extract embeddings was wrong?

sjunhongshen / ORCA

How can I get the text_xs and text_ys #4