Getting SMILES Vector in Pretrained ChemBERTa Model

Firstly, thanks for open sourcing the model. I want to cluster SMILES compounds by using their pretrained model vectors. I have implemented following code for this but it doesn't work for more than 1000 row data. Idk if I am doing this correctly since I haven't seen any official documentation for this. The only relatable code part in the repo is viz_utils.gen_embeddings function. Unfortunately it also suffers from larger SMILES dataset size

I am open to suggestions. Thank in advance.

from transformers import RobertaTokenizerFast, RobertaModel
from sklearn.cluster import KMeans
import torch

model_name = "DeepChem/ChemBERTa-77M-MLM"
tokenizer = RobertaTokenizerFast.from_pretrained(model_name)
model = RobertaModel.from_pretrained(model_name, output_hidden_states = True)
model.eval()

smiles_compounds = [
    "O=C(Cc1cccc2ccccc12)Nc1n[nH]c2ccc(N3CCCS3(=O)=O)cc12",
    "COC(=O)NC[C@@H](NC(=O)c1ccc(-c2nc(C3CCOCC3)cnc2N)cc1F)c1cccc(Br)c1",
    "COc1ccccc1Nc1cc(Oc2cc(C)c(C)nc2-c2ccccn2)ccn1",
    "O=C(/C=C/CN1CCCC1)N1CCOc2cc3ncnc(Nc4ccc(F)c(Cl)c4)c3cc21",
]

inputs = tokenizer(smiles_compounds, return_tensors='pt', padding=True, truncation=True)

with torch.no_grad():
    out = model(**inputs)

# Shape is: [len(smiles_compounds), 65, 384]
states = out.hidden_states[-1].squeeze()

# Average the token vectors for each sample, which will give you a single 384-dimensional vector for each sample.
states_2d = states.mean(dim=1).numpy()
states_2d.shape

kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(states_2d)

clusters = kmeans.predict(states_2d)
clusters

seyonechithrananda / bert-loves-chemistry

Getting SMILES Vector in Pretrained ChemBERTa Model #58