kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
18 stars 1 forks source link

Input data structure for molecule #10

Closed EasternCaveMan closed 6 months ago

EasternCaveMan commented 8 months ago

Hi Roman,

I've noticed that DataSAIL doesn't currently accept InChI as input for molecules. However, I'm wondering if I could convert the InChI to ECFP and then provide it as input to DataSAIL? It seems that converting InChI to SMILES is quite risky, as indicated in these discussions: https://github.com/rdkit/rdkit/issues/542 https://sourceforge.net/p/rdkit/mailman/message/36696861/

It would significantly enhance usability if DataSAIL could also accept ECFP as input for molecules, making it more universal.

Old-Shatterhand commented 8 months ago

Yes, fingerprint input may be an idea. What's also possible is to calculate the similarity-matrix on your own and input that with the --e-sim argument as in this example.

Old-Shatterhand commented 6 months ago

This has been solved with commit 35954ce and in PR #22.