theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.
https://arxiv.org/abs/2204.13545
MIT License
88 stars 23 forks source link

Add ChemicalRepresentation #15

Closed siboehm closed 2 years ago

siboehm commented 2 years ago

Add a class that takes in a list of canonical SMILES and returns a fixed-sized embedding of the molecules (torch.Embedding).

Interface

# each model will inherit from this interface
class ChemicalRepresentation:
  @classmethod
  def dim(cls):
    return: int # the number of latent dimensions for this model
  def __init__(self, dataset: str):
    # loads the model into memory
  def encode(self, molecules: list[str]):
    # encodes the given list of SMILES strings into a torch Embedding
    returns emb: torch.Embedding where emb.shape[0] == len(molecules)
  def decode(self, emb: torch.Embedding):
    returns list[str]: list of SMILES

# maps the string to the class
EMBEDDING_MODELS = {
  "GROVER": GroverRepresentation
}

# for the given list of SMILES strings, returns a dataframe with two columns:
# column 1: SMILES: the smiles string
# column 2: Embedding: numpy array
# Casting back to a torch tensor has to be done at the Dataloading-level eg in the get_item method.
def get_chemical_representation_df(molecules: list[str], embedding_model: str, dataset: str, cache_dir="datasets/embedding"):
  if cache_dir is not None and (Path(cache_dir) / f"{embedding_model}_{dataset}_df.parquet")).exists():
    # load the dataframe files and return it
  else:
   model = EMBEDDING_MODELS[embedding_model](dataset)
   embedding = model.encode(molecules)
   df = pandas.DataFrame.from_dict({"SMILE": molecules, "embedding": list(embedding)}).
   if cache_dir is not None:
     df.to_parquet(f"{embedding_model}_{dataset}_df.parquet")
   return df
MxMstrmn commented 2 years ago

Current Status

Similar to your grover embedding, pretrained gnns can now be used to compute embeddings. Workflow is similar to yours, using parquet for dumping. See here: https://github.com/theislab/chemical_CPA/blob/0226a49485b5977e07fdd923e679d241a25569f4/notebooks/embedding_pretrained_gnn.ipynb

Thoughts how to proceed

Since we now store those embeddings, what is left now is a simple class that uses a simple linear layer to map the latent space of the respective model.

For this, we would have to adjust:

siboehm commented 2 years ago

I agree with the three thoughts. Ref 3.): Yes we have to change this, I've already adapted the SigmoidDoser, the MLP still needs to be changed.