Closed benjamc closed 1 month ago
HI @benjamc,
The kernel in the code is indeed the subset string kernel from Moss et al. 2020!
The SMILES string kernel is (somewhat confusingly) referenced as "bag of SMILES" or "bag of amino acids" in the notebooks:
We don't have a bespoke implementation for the SMILES string kernel since the n-gram featurisation can be computed upfront and the Tanimoto kernel applied. Opening a PR now to reference Cao et al. 2012 in the code because it's indeed very unclear that the SMILES string kernel is equivalent to "bag of SMILES"!
Linking in @henrymoss and @aryandeshwal to see if they have any further thoughts!
I've now updated the notebooks and added a README to the string kernel directory that hopefully makes this clearer!
Hi @Ryan-Rhys, thanks for the quick answer and clarification! :sun_with_face:
Hi, in the paper it says you have implemented the SMILES string kernel [Cao et al., 2012] and the subset string kernel [Moss et al., 2020]. In the code the only string kernel is called
SubsequenceStringKernel
and refers to the Moss paper and to Beck et al. [2017]. Can you confirm that the kernel in the code is the subset string kernel and not the SMILES string kernel? And what happened to the latter? Best wishes and cool library!