keiserlab / e3fp

3D molecular fingerprints
GNU Lesser General Public License v3.0
121 stars 33 forks source link

Add examples for simple "fingerprint algebra" #3

Closed sethaxen closed 7 years ago

sethaxen commented 9 years ago

None of the existing examples explain the useful fingerprint algebra that can be done. This should be added as an example or to the README.

mjke commented 9 years ago

Perhaps a brief script showing how to load/use e3fp fingerprints with RDKit FingerprintSimilarity? (docs)

sethaxen commented 9 years ago

To clarify, would you like E3FP fingerprints to be able to be used as input to FingerprintSimilarity in RDKit? I think this would mean adding a method that converts the E3FP Fingerprint class to an RDKit ExplicitBitVect or SparseIntVect, which are their formats for fingerprints.

What I'm suggesting here is script that explains how fingerprint algebra can be used with the Fingerprint class already written (e.g. bitwise or). It's less extensive than what RDKit provides but requires no conversion.

mjke commented 9 years ago

Your call but it'd likely be the most enabling to go the RDKit route, especially since other parts of the code like conformer generation already use RDKit. @mmysinger, do you know an easy way to load a 'sea-native' or ascii-bitstring format fingerprint into RDKit?

sethaxen commented 9 years ago

I agree. Alternatively, since conversion to 'sea-native' is going to be split out into a separate repo, is there an easy way to take a simple array of 'on' bit indices and turn it into an RDKit fingerprint? I imagine just initializing an RDKit fingerprint of 0s and then turning specific bits on but figure it's probably more complex than that.

mmysinger commented 9 years ago

I've probably done this and forgotten how, but in my experience basically any fingerprinting tool can handle the 0 or 1 type bitstrings. They are long and inefficient, but universal. So RDKit should have an "easy" way to convert them to a BitVector.

FPCore does the opposite a couple of times, if you need a starting point for google or doc searching.

I have a pure python and publicly releasable (with SeaChange attribution) version of sea-native to other fingerprint formats if you decide you need it.

Cheers, Michael

On Wed, Oct 21, 2015 at 3:04 PM, Seth Axen notifications@github.com wrote:

I agree. Alternatively, since conversion to 'sea-native' is going to be split out into a separate repo, is there an easy way to take a simple array of 'on' bit indices and turn it into an RDKit fingerprint? I imagine just initializing an RDKit fingerprint of 0s and then turning specific bits on but figure it's probably more complex than that.

— Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/3#issuecomment-150037949.

sethaxen commented 9 years ago

Thanks, @mmysinger! I'll check how you create these in FPCore. I'd like to go directly from my fingerprints to RDKit for simplicity, but I'll check with you on the sea-native code if I can't make it work.

I'll add to Fingerprint objects two new methods: Fingerprint.to_rdkit which just outputs the RDKit style fingerprint, and a new class method Fingerprint.from_rdkit which instantiates a new Fingerprint from an RDKit fingerprint.

mjke commented 9 years ago

Sounds good. If I remember right, here's a snippet (from the wiki, which points to this slack snippet) to go the other way, from an RDKit-generated fingerprint to a "1010..." ASCII bitstring:

from rdkit import Chem
m1 = Chem.MolFromSmiles('Cc1ccccc1')
fp1 = Chem.AllChem.GetMorganFingerprintAsBitVect(m1,2)
print fp1.ToBitString()

More on RDKit ExplicitBitVect here.

mmysinger commented 9 years ago

Alternatively rdkit also has a base64 encoding similar to sea-native, which could be converted using string.translate() and maketrans()

RDKIT_CHAR = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

NATIVE_CHR = ".+0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

GLHF

On Wed, Oct 21, 2015 at 3:56 PM, michael keiser notifications@github.com wrote:

Sounds good. If I remember right, here's a snippet (from the wiki https://sites.google.com/a/keiserlab.org/wiki/internal/code/snippets, which points to this slack snippet https://keiserlab.slack.com/files/keiser/F029ZTCB4/code_to_generate_rdkit_ecfp_fingerprints.py) to go the other way, from an RDKit-generated fingerprint to a "1010..." ASCII bitstring:

from rdkit import Chem m1 = Chem.MolFromSmiles('Cc1ccccc1') fp1 = Chem.AllChem.GetMorganFingerprintAsBitVect(m1,2)print fp1.ToBitString()

More on RDKit ExplicitBitVect here http://www.rdkit.org/Python_Docs/rdkit.DataStructs.cDataStructs.ExplicitBitVect-class.html .

— Reply to this email directly or view it on GitHub https://github.com/keiserlab/e3fp/issues/3#issuecomment-150047530.

sethaxen commented 9 years ago

This ended up being more or less trivial to do, so the latest commit e72db0e0d850ad17bc19b9385afd93b40944fc60 has this functionality. I've documented a quirk there which I'll mention in more detail here. The fingerprints RDKit generates can be no more than 2^32 - 1 in length, so only unfolded fingerprints result in bitvectors of length not a multiple of 2. If a user of E3FP generates a 32-bit fingerprint and converts it to RDKit, it'll be one element too long, ergo overflow error. To handle that, when converting to an RDKit fingerprint, indices are modded to 2^32 - 1. When generating a fingerprint from an RDKit fingerprint, if the RDKit fingerprint is 2^32 - 1 in length, the corresponding E3FP fingerprint will be length 2^32.

mjke commented 9 years ago

Great. I think that closes this, right? BTW, you might like this: If you put something like "fix #x" or "close #x" into the commit message, it'll automatically reference issue #x and close it (see Mastering Issues).

sethaxen commented 9 years ago

Not quite. This issue was originally about an example with fingerprint algebra, and I still think that's a good idea.

But closing an issue with a commit is so cool!

sethaxen commented 7 years ago

Also, fingerprint/database comparison examples from e3fp.fingerprint.metrics

goldnight commented 7 years ago

Dear developpers,

I am Kinya Toda from MOLSIS Inc. in Japan. We are the distributor of MOE in Japan. One of Japanese pharmaceutical company which is our customer is interested in E3FP. In order to support our customer we tried E3FP program. Although we got the .fps file, we could not understand the .fps format. Would you let us know the deitails of .fps format?

Many thanks, Kinya

sethaxen commented 7 years ago

@goldnight, We appreciate the interest in E3FP and your question. The .fps file contains a FingerprintDatabase object. The FingerprintDatabase wraps the fingerprints stored in a SciPy sparse csr_matrix for efficient memory use, provides an interface to E3FP's Fingerprint objects, and enables batch fingerprint manipulation.

Here is an example code snippet that loads a database, accesses its underlying array, extracts individual fingerprints, generates a copy of the database folded to a smaller number of bits, and computes pairwise Tanimoto coefficients:

>>> import e3fp.fingerprint.db as db
>>> import e3fp.fingerprint.metrics as metrics
>>> fpdb = db.FingerprintDatabase.load("db_file.fps")
>>> print(fpdb)
FingerprintDatabase[name: None  fp_type: Fingerprint  level: 5  bits: 4294967296  fp_num: 3]
>>> fpdb.array
<3x4294967296 sparse matrix of type '<type 'numpy.bool_'>'
    with 135 stored elements in Compressed Sparse Row format>
>>> fpdb[0]
Fingerprint(indices=array([236364480, 285017645, 486993694, 526356276, 563733172, 714291194, 1355662222, 1377583001, 2146727905, 2219502143, 2453405156, 2523886517, 2738552880, 3079013085, 3471456286, 3560653707, 3588767207, 3654310210, 3982335018, 4222155853]), level=5, bits=4294967296, name=CHEMBL2110918-0_0)
>> fpdb["CHEMBL270807-0_0"]
[Fingerprint(indices=array([89404292, 236364480, 486993694, 526356276, 532215124, 563733172, 797308679, 848946638, 1291267363, 1439541318, 1654859292, 1748414490, 1754436793, 1806192060, 1809343852, 2033263744, 2151417394, 2257240192, 2279614144, 2294967576, 2330626290, 2420804073, 2453405156, 2738552880, 2782322589, 3493093279, 3533693651, 3560653707, 3654310210, 3871898055, 3947101849, 4072641186, 4218183267]), level=5, bits=4294967296, name=CHEMBL270807-0_0)]
>>> fpdb_folded = fpdb.fold(1024)
>>> print(fpdb_folded)
FingerprintDatabase[name: None  fp_type: Fingerprint  level: 5  bits: 1024  fp_num: 3]
>>> metrics.tanimoto(fpdb_folded)
array([[ 1.        ,  0.16216216,  0.0617284 ],
       [ 0.16216216,  1.        ,  0.08536585],
       [ 0.0617284 ,  0.08536585,  1.        ]])

The development version of e3fp uses a file format called .fpz that can be saved and loaded much more efficiently, which is important when the database is very large. More in-depth documentation is coming soon. In the meantime, please don't hesitate to contact us with questions.

goldnight commented 7 years ago

Dear Dr. Axen,

Thank you for quick response. I will try making the scripts according to your sample code.

Thanks, Kinya