johnmay / efficient-bits

Self-contained projects for code examples from efficientbits.blogspot.com posts.
BSD 2-Clause "Simplified" License
3 stars 1 forks source link

Request: respect original SMILES index in fp-idx #2

Open tantrev opened 5 years ago

tantrev commented 5 years ago

It would just be really nice if the fp-idx tools retained the same index number as their original SMILES input. I've ran into a problem where the index of skipped SMILES strings is ignored, which consequently messes up the indexing between fp-idx's downstream .fps and .idx files. This would especially useful for getting the original SMILES string back from similarity searches.

johnmay commented 5 years ago

Hi, I wrote this code as part of a blog post mainly to show what a good baseline looks like. At the time there were some talks/papers coming out with some complex algorithms that claimed to be useful only because the existing implementations were bad.

Feel free to submit patches but if you're mainly just using it I would recommend trying out: http://chemfp.com/.

tantrev commented 5 years ago

Thank you for the quick reply! I'm impressed - your code is really nice, especially for a blog post. If multi-threading was added to this beauty, it'd be really close to a full-fledged solution. :P

Yeah, the problem with Chemfp is that it basically requires OpenEye's software for fast fingerprint generation (the RDKit and OpenBabel implementations look slow as molasses) and the free version of Chemfp doesn't handle that many molecules (the paid version also seems to be a little iffy as to whether it can really support more than 300 million molecules).

Your code actually supports most of my needs and seems to be the best free offering around. I'll see if I can figure out this indexing issue too - I was just wondering if it was something easy for you to fix.

johnmay commented 5 years ago

You can use CDK to generate an FPS and use it with ChemFP.

johnmay commented 5 years ago

BTW this is my paid version: https://www.nextmovesoftware.com/arthor.html

tantrev commented 5 years ago

Thank you! Yeah, maybe the ChemFP road with CDK might be viable too - it's just annoying to have to write custom software after paying a licensing fee. You definitely have the best solution with Arthor - it's just going to take a bit of work for my lowly "research assistant" self to raise the necessary funds. :P