Closed sethaxen closed 7 years ago
Perhaps a brief script showing how to load/use e3fp fingerprints with RDKit FingerprintSimilarity? (docs)
To clarify, would you like E3FP fingerprints to be able to be used as input to FingerprintSimilarity
in RDKit? I think this would mean adding a method that converts the E3FP Fingerprint class to an RDKit ExplicitBitVect
or SparseIntVect
, which are their formats for fingerprints.
What I'm suggesting here is script that explains how fingerprint algebra can be used with the Fingerprint
class already written (e.g. bitwise or). It's less extensive than what RDKit provides but requires no conversion.
Your call but it'd likely be the most enabling to go the RDKit route, especially since other parts of the code like conformer generation already use RDKit. @mmysinger, do you know an easy way to load a 'sea-native' or ascii-bitstring format fingerprint into RDKit?
I agree. Alternatively, since conversion to 'sea-native' is going to be split out into a separate repo, is there an easy way to take a simple array of 'on' bit indices and turn it into an RDKit fingerprint? I imagine just initializing an RDKit fingerprint of 0s and then turning specific bits on but figure it's probably more complex than that.
I've probably done this and forgotten how, but in my experience basically any fingerprinting tool can handle the 0 or 1 type bitstrings. They are long and inefficient, but universal. So RDKit should have an "easy" way to convert them to a BitVector.
FPCore does the opposite a couple of times, if you need a starting point for google or doc searching.
I have a pure python and publicly releasable (with SeaChange attribution) version of sea-native to other fingerprint formats if you decide you need it.
Cheers, Michael
On Wed, Oct 21, 2015 at 3:04 PM, Seth Axen notifications@github.com wrote:
I agree. Alternatively, since conversion to 'sea-native' is going to be split out into a separate repo, is there an easy way to take a simple array of 'on' bit indices and turn it into an RDKit fingerprint? I imagine just initializing an RDKit fingerprint of 0s and then turning specific bits on but figure it's probably more complex than that.
— Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/3#issuecomment-150037949.
Thanks, @mmysinger! I'll check how you create these in FPCore. I'd like to go directly from my fingerprints to RDKit for simplicity, but I'll check with you on the sea-native code if I can't make it work.
I'll add to Fingerprint
objects two new methods: Fingerprint.to_rdkit
which just outputs the RDKit style fingerprint, and a new class method Fingerprint.from_rdkit
which instantiates a new Fingerprint
from an RDKit fingerprint.
Sounds good. If I remember right, here's a snippet (from the wiki, which points to this slack snippet) to go the other way, from an RDKit-generated fingerprint to a "1010..." ASCII bitstring:
from rdkit import Chem
m1 = Chem.MolFromSmiles('Cc1ccccc1')
fp1 = Chem.AllChem.GetMorganFingerprintAsBitVect(m1,2)
print fp1.ToBitString()
More on RDKit ExplicitBitVect here.
Alternatively rdkit also has a base64 encoding similar to sea-native, which could be converted using string.translate() and maketrans()
RDKIT_CHAR = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
NATIVE_CHR = ".+0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
GLHF
On Wed, Oct 21, 2015 at 3:56 PM, michael keiser notifications@github.com wrote:
Sounds good. If I remember right, here's a snippet (from the wiki https://sites.google.com/a/keiserlab.org/wiki/internal/code/snippets, which points to this slack snippet https://keiserlab.slack.com/files/keiser/F029ZTCB4/code_to_generate_rdkit_ecfp_fingerprints.py) to go the other way, from an RDKit-generated fingerprint to a "1010..." ASCII bitstring:
from rdkit import Chem m1 = Chem.MolFromSmiles('Cc1ccccc1') fp1 = Chem.AllChem.GetMorganFingerprintAsBitVect(m1,2)print fp1.ToBitString()
More on RDKit ExplicitBitVect here http://www.rdkit.org/Python_Docs/rdkit.DataStructs.cDataStructs.ExplicitBitVect-class.html .
— Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/3#issuecomment-150047530.
This ended up being more or less trivial to do, so the latest commit e72db0e0d850ad17bc19b9385afd93b40944fc60 has this functionality. I've documented a quirk there which I'll mention in more detail here. The fingerprints RDKit generates can be no more than 2^32 - 1 in length, so only unfolded fingerprints result in bitvectors of length not a multiple of 2. If a user of E3FP generates a 32-bit fingerprint and converts it to RDKit, it'll be one element too long, ergo overflow error. To handle that, when converting to an RDKit fingerprint, indices are modded to 2^32 - 1. When generating a fingerprint from an RDKit fingerprint, if the RDKit fingerprint is 2^32 - 1 in length, the corresponding E3FP fingerprint will be length 2^32.
Great. I think that closes this, right? BTW, you might like this: If you put something like "fix #x" or "close #x" into the commit message, it'll automatically reference issue #x and close it (see Mastering Issues).
Not quite. This issue was originally about an example with fingerprint algebra, and I still think that's a good idea.
But closing an issue with a commit is so cool!
Also, fingerprint/database comparison examples from e3fp.fingerprint.metrics
Dear developpers,
I am Kinya Toda from MOLSIS Inc. in Japan. We are the distributor of MOE in Japan. One of Japanese pharmaceutical company which is our customer is interested in E3FP. In order to support our customer we tried E3FP program. Although we got the .fps file, we could not understand the .fps format. Would you let us know the deitails of .fps format?
Many thanks, Kinya
@goldnight, We appreciate the interest in E3FP and your question. The .fps
file contains a FingerprintDatabase
object. The FingerprintDatabase
wraps the fingerprints stored in a SciPy sparse csr_matrix
for efficient memory use, provides an interface to E3FP's Fingerprint
objects, and enables batch fingerprint manipulation.
Here is an example code snippet that loads a database, accesses its underlying array, extracts individual fingerprints, generates a copy of the database folded to a smaller number of bits, and computes pairwise Tanimoto coefficients:
>>> import e3fp.fingerprint.db as db
>>> import e3fp.fingerprint.metrics as metrics
>>> fpdb = db.FingerprintDatabase.load("db_file.fps")
>>> print(fpdb)
FingerprintDatabase[name: None fp_type: Fingerprint level: 5 bits: 4294967296 fp_num: 3]
>>> fpdb.array
<3x4294967296 sparse matrix of type '<type 'numpy.bool_'>'
with 135 stored elements in Compressed Sparse Row format>
>>> fpdb[0]
Fingerprint(indices=array([236364480, 285017645, 486993694, 526356276, 563733172, 714291194, 1355662222, 1377583001, 2146727905, 2219502143, 2453405156, 2523886517, 2738552880, 3079013085, 3471456286, 3560653707, 3588767207, 3654310210, 3982335018, 4222155853]), level=5, bits=4294967296, name=CHEMBL2110918-0_0)
>> fpdb["CHEMBL270807-0_0"]
[Fingerprint(indices=array([89404292, 236364480, 486993694, 526356276, 532215124, 563733172, 797308679, 848946638, 1291267363, 1439541318, 1654859292, 1748414490, 1754436793, 1806192060, 1809343852, 2033263744, 2151417394, 2257240192, 2279614144, 2294967576, 2330626290, 2420804073, 2453405156, 2738552880, 2782322589, 3493093279, 3533693651, 3560653707, 3654310210, 3871898055, 3947101849, 4072641186, 4218183267]), level=5, bits=4294967296, name=CHEMBL270807-0_0)]
>>> fpdb_folded = fpdb.fold(1024)
>>> print(fpdb_folded)
FingerprintDatabase[name: None fp_type: Fingerprint level: 5 bits: 1024 fp_num: 3]
>>> metrics.tanimoto(fpdb_folded)
array([[ 1. , 0.16216216, 0.0617284 ],
[ 0.16216216, 1. , 0.08536585],
[ 0.0617284 , 0.08536585, 1. ]])
The development version of e3fp
uses a file format called .fpz
that can be saved and loaded much more efficiently, which is important when the database is very large. More in-depth documentation is coming soon. In the meantime, please don't hesitate to contact us with questions.
Dear Dr. Axen,
Thank you for quick response. I will try making the scripts according to your sample code.
Thanks, Kinya
None of the existing examples explain the useful fingerprint algebra that can be done. This should be added as an example or to the README.