openbabel / openbabel

Open Babel is a chemical toolbox designed to speak the many languages of chemical data.
http://openbabel.org/
GNU General Public License v2.0
1.03k stars 405 forks source link

Do fingerprints change between OpenBabel releases? #2581

Closed drjahu closed 1 year ago

drjahu commented 1 year ago

Dear Open Babel team! I am working on a project storing molecular structures in a database along with different types of fingerprints (MACCS, FP2, ECFP4 and ECFP6). We aim to select subsets of chemical classes and to project chemical substances together with activity data into a 2D projection of the the chemical space. All chemical information is processed using Open Babel and our pipeline shows very nice results. However, recently I noticed that fingerprints for appropiatly 1% of 50Mio Compounds differ between old processing stored in the database and the most recent computation of the fingerprint using the current release 3.1.1 of Open Babel. I need to add that fingerprints were computet since 2019, spanning several years and different releases Open Babel 2.4.0, 3.0.0, 3.1.1, and different computers. My questions are: A) Do I need to expect fingerprints of one chemical structure to change between releases of Open Babel? B) Are fingerprints sensitive wheter or not --gen2D or --gen3D is applied to the OBMOL structure? C) As the chemical structures are derived from many different sources in several file formats (mol, sdf, smiles, inchi, ...) does the InChI code differ between structures purely loaded from the source and a structure where --gen2D or --gen3D was applied? What would you recommend the harmonise structures from differernt sources?

Thank you very much for your time and effort! Looking forward for a reply and any link of litearture is highly appreciated, best regards, Jan

fredrikw commented 1 year ago

Hello, A) Yes, you should expect that fingerprints MAY change. While the OpenBabel code is rather stable, there might always be a bug fixed or a change made that affects some part of the underlying molecule data structure and hence the fingerprint. In general, it is always best to be consistent to one version of the chemoinformatics software. B) They should not be sensitive to the --genxD options. C) I don't really follow the change to InChI here, but the InChI generation should not be sensitive to how the molecule is loaded (unless it comes from an underspecified format such as pdb where guessing of the bond system might lead to different molecules percieved). Normalizing through InChI might be a good option but there are a lot of things to consider before deciding on a normalization scheme.

drjahu commented 1 year ago

Dear @fredrikw! Thank you so much for the fast reply! best regards, Jan