Add ChemicalRepresentation

siboehm commented 2 years ago

Add a class that takes in a list of canonical SMILES and returns a fixed-sized embedding of the molecules (torch.Embedding).

Interface

# each model will inherit from this interface
class ChemicalRepresentation:
  @classmethod
  def dim(cls):
    return: int # the number of latent dimensions for this model
  def __init__(self, dataset: str):
    # loads the model into memory
  def encode(self, molecules: list[str]):
    # encodes the given list of SMILES strings into a torch Embedding
    returns emb: torch.Embedding where emb.shape[0] == len(molecules)
  def decode(self, emb: torch.Embedding):
    returns list[str]: list of SMILES

# maps the string to the class
EMBEDDING_MODELS = {
  "GROVER": GroverRepresentation
}

# for the given list of SMILES strings, returns a dataframe with two columns:
# column 1: SMILES: the smiles string
# column 2: Embedding: numpy array
# Casting back to a torch tensor has to be done at the Dataloading-level eg in the get_item method.
def get_chemical_representation_df(molecules: list[str], embedding_model: str, dataset: str, cache_dir="datasets/embedding"):
  if cache_dir is not None and (Path(cache_dir) / f"{embedding_model}_{dataset}_df.parquet")).exists():
    # load the dataframe files and return it
  else:
   model = EMBEDDING_MODELS[embedding_model](dataset)
   embedding = model.encode(molecules)
   df = pandas.DataFrame.from_dict({"SMILE": molecules, "embedding": list(embedding)}).
   if cache_dir is not None:
     df.to_parquet(f"{embedding_model}_{dataset}_df.parquet")
   return df

MxMstrmn commented 2 years ago

Current Status

Similar to your grover embedding, pretrained gnns can now be used to compute embeddings. Workflow is similar to yours, using parquet for dumping. See here: https://github.com/theislab/chemical_CPA/blob/0226a49485b5977e07fdd923e679d241a25569f4/notebooks/embedding_pretrained_gnn.ipynb

Thoughts how to proceed

Since we now store those embeddings, what is left now is a simple class that uses a simple linear layer to map the latent space of the respective model.

For this, we would have to adjust:

[ ] dataloading, here we would not store the graphs but a dictionary of {'smiles': latent}, cf. https://github.com/theislab/chemical_CPA/blob/9ab2ffa7fbc4cf8253155011cf65f8cecc9988af/compert/data.py#L134
[ ] We have to replace the current Drugemb class with our new class, atm the latent vectors for all drugs are returned at each forward pass, cf. https://github.com/theislab/chemical_CPA/blob/42a3dbe0104c6a8dcbb9cf8021c0ebd3b242fa7f/compert/graph_model/graph_model.py#L76
[ ] This is linked to the way how the compert model selects the drug representation in the forward pass, doe we have to change things here? The self.dosers are ohe with the respective dose, cf. https://github.com/theislab/chemical_CPA/blob/f9ba2e78446f712fed1daf1153dc1bcbd53ff395/compert/model.py#L412

siboehm commented 2 years ago

I agree with the three thoughts. Ref 3.): Yes we have to change this, I've already adapted the SigmoidDoser, the MLP still needs to be changed.

theislab / chemCPA