leojklarner / gauche

A Library for Gaussian Processes in Chemistry
https://leojklarner.github.io/gauche/
MIT License
213 stars 22 forks source link

[Question] String kernels: Which is which #71

Closed benjamc closed 1 month ago

benjamc commented 1 month ago

Hi, in the paper it says you have implemented the SMILES string kernel [Cao et al., 2012] and the subset string kernel [Moss et al., 2020]. In the code the only string kernel is called SubsequenceStringKernel and refers to the Moss paper and to Beck et al. [2017]. Can you confirm that the kernel in the code is the subset string kernel and not the SMILES string kernel? And what happened to the latter? Best wishes and cool library!

Ryan-Rhys commented 1 month ago

HI @benjamc,

The kernel in the code is indeed the subset string kernel from Moss et al. 2020!

The SMILES string kernel is (somewhat confusingly) referenced as "bag of SMILES" or "bag of amino acids" in the notebooks:

  1. https://github.com/leojklarner/gauche/blob/main/notebooks/GP%20Regression%20on%20Molecules.ipynb
  2. https://github.com/leojklarner/gauche/blob/main/notebooks/Protein%20Fitness%20Prediction%20-%20Bag%20of%20Amino%20Acids.ipynb

We don't have a bespoke implementation for the SMILES string kernel since the n-gram featurisation can be computed upfront and the Tanimoto kernel applied. Opening a PR now to reference Cao et al. 2012 in the code because it's indeed very unclear that the SMILES string kernel is equivalent to "bag of SMILES"!

Linking in @henrymoss and @aryandeshwal to see if they have any further thoughts!

Ryan-Rhys commented 1 month ago

I've now updated the notebooks and added a README to the string kernel directory that hopefully makes this clearer!

benjamc commented 1 month ago

Hi @Ryan-Rhys, thanks for the quick answer and clarification! :sun_with_face: