stanford-crfm / BioMedLM

589 stars 61 forks source link

sentence embedding #15

Open orhansonmeztr opened 1 year ago

orhansonmeztr commented 1 year ago

Hi. First of all, thank you for making such a model available to us. I am trying to get vector embeddings of abstracts of some of the articles in PubMed. But somehow I couldn't get the sentence embeddings. More precisely, I wrote the code below and the dimensions of the vectors I obtained are 2560. But on the huggingface page, it says sequence length is 1024. So I understand that the dimension of an embedding vector should be 1024. Am I wrong? Can you help with getting sentence embeddings? Best wishes. Orhan

tokenizer = AutoTokenizer.from_pretrained("BioMedLM")
model = AutoModel.from_pretrained("BioMedLM")
tokenizer.pad_token = tokenizer.eos_token

f = open('articles.json', "r")
data = json.loads(f.read())
data_abst = [data[i]['abstract'] for i in range(len(data))]
data_title = [data[i]['title'] for i in range(len(data))]

def normalizer(x):     
    normalized_vector = x / np.linalg.norm(x)
    return normalized_vector

class BioMedLM:    
    def __init__(self, model, tokenizer):
        # self.sentence = sentence
        self.model = model
        self.tokenizer = tokenizer

    def sentence_vectors(self,sentence):
        inputs = self.tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")
        w_vectors = self.model(**inputs)

        # return w_vectors
        token_embeddings = w_vectors[0] #First element of model_output contains all token embeddings
        input_mask_expanded = inputs.attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        vec=torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return vec[0]

gpt_class = BioMedLM(model, tokenizer)

def sentence_encoder(data):
    vectors = []
    normalized_vectors = []
    for i in range(len(data)):
        sentence_vectors = gpt_class.sentence_vectors(data[i]).detach().numpy()
        vectors.append(sentence_vectors)
        normalized_vectors.append(normalizer(sentence_vectors))

    vectors = np.squeeze(np.array(vectors))
    normalized_vectors = np.squeeze(np.array(normalized_vectors))

    return vectors, normalized_vectors

abst_vectors, abst_vectors_norm = sentence_encoder(data_abst) 
J38 commented 1 year ago

I'm not super familiar with generating document level representations from GPT-2 models, but your code looks like it is summing the hidden states for each position and normalizing? That would be 2560. Another way is to just look at the final hidden state, which would also be 2560. But I would expect the document level vector to be 2560 since you are going to use some algorithm to combine the size=2560 vectors into one final vector.

Could you point to paper, algorithm, code of how you want to generate final abstract-level representations ? As I said, it looks like your method is to add all of the final states and normalize. I think typically one would just take the final hidden state of the sequence.

What task do you want to use these abstract level vectors for ?