seyonechithrananda / bert-loves-chemistry

bert-loves-chemistry: a repository of HuggingFace models applied on chemical SMILES data for drug design, chemical modelling, etc.
MIT License
389 stars 60 forks source link

Getting SMILES Vector in Pretrained ChemBERTa Model #58

Closed ilkersigirci closed 11 months ago

ilkersigirci commented 1 year ago

Firstly, thanks for open sourcing the model. I want to cluster SMILES compounds by using their pretrained model vectors. I have implemented following code for this but it doesn't work for more than 1000 row data. Idk if I am doing this correctly since I haven't seen any official documentation for this. The only relatable code part in the repo is viz_utils.gen_embeddings function. Unfortunately it also suffers from larger SMILES dataset size

I am open to suggestions. Thank in advance.

from transformers import RobertaTokenizerFast, RobertaModel
from sklearn.cluster import KMeans
import torch

model_name = "DeepChem/ChemBERTa-77M-MLM"
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
model = RobertaModel.from_pretrained(model_name, output_hidden_states = True)
model.eval()

smiles_compounds = [
    "O=C(Cc1cccc2ccccc12)Nc1n[nH]c2ccc(N3CCCS3(=O)=O)cc12",
    "COC(=O)NC[C@@H](NC(=O)c1ccc(-c2nc(C3CCOCC3)cnc2N)cc1F)c1cccc(Br)c1",
    "COc1ccccc1Nc1cc(Oc2cc(C)c(C)nc2-c2ccccn2)ccn1",
    "O=C(/C=C/CN1CCCC1)N1CCOc2cc3ncnc(Nc4ccc(F)c(Cl)c4)c3cc21",
]

inputs = tokenizer(smiles_compounds, return_tensors='pt', padding=True, truncation=True)

with torch.no_grad():
    out = model(**inputs)

# Shape is: [len(smiles_compounds), 65, 384]
states = out.hidden_states[-1].squeeze()

# Average the token vectors for each sample, which will give you a single 384-dimensional vector for each sample.
states_2d = states.mean(dim=1).numpy()
states_2d.shape

kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(states_2d)

clusters = kmeans.predict(states_2d)
clusters
Chris-Tang6 commented 10 months ago

Hi ilker, I got a problem when I tokenize smiles seq, And I found that in you example smiles has the Cl atom. So I want to konw if you meet the same question.

The Iuput seq is COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl However the output of tokenizer incorrectly labeled the Cl as C, I don't konw how the fix it. And I also found that in the ChenBERTa token table, it indeed has the 'Cl' token. image

The output of the tokenize() result as follow: ['C', 'O', 'C', '1', '=', 'C', '(', 'C', '=', 'C', '2', 'C', '(', '=', 'C', '1', ')', 'C', 'C', 'N', '=', 'C', '2', 'C', '3', '=', 'C', 'C', '(', '=', 'C', '(', 'C', '=', 'C', '3', ')', 'C', ')', 'C', ')', 'C']