isayev / ANI1_dataset

A data set of 20 million calculated off-equilibrium conformations for organic molecules
MIT License
96 stars 18 forks source link

Potential smile/coordinate discrepency #8

Open max-hoffman opened 6 years ago

max-hoffman commented 6 years ago

Hello,

I was trying to convert the ANI-1 dataset into a parquet format, and I ran into a potential mismatch between the coordinates and smiles string of at least one molecule (around 4k conformers).

I wrote a piece of sample code to try to isolate this first issue I ran into (Python 2.7.6 interpreter):

import h5py
from pybel import readstring
import json
import numpy as np
import pandas as pd

ani_path = '.../ani'
shard3 = os.path.join(ani_path, 'ani_gdb_s03.h5')

with h5py.File(shard3, 'r') as f:
    data_dict = f['gdb11_s03/gdb11_s03-11']

    coords     = data_dict['coordinates']
    elements   = data_dict['species']
    energies   = data_dict['energies']
    smi        = ''.join(data_dict['smiles'])

    mol = readstring('smi', smi)
    jmol = json.loads(pymol_to_json(mol))

    if len(jmol['atoms']) != len(elements[:]):
        print "shard: ", shard1
        print "\nmolecule: gdb11_s03/gdb11_s03-11"
        print "\nsmile: ", smi
        print "\nspecies:", elements[:]
        print "\npybel mol:", jmol
        print "\ncoordinates: ", coords.shape

with sample output:

shard:  .../ani_gdb_s03.h5
molecule: gdb11_s03/gdb11_s03-11
smile:  [H]C([H])=NN([H])[H]
species ['O' 'C' 'O' 'H' 'H']
pybel mol {u'atoms': [[1, 0], [6, 0], [1, 0], [7, 0], [7, 0], [1, 0], [1, 0]], u'bonds': [[1, 2, 1], [2, 3, 1], [2, 4, 2], [4, 5, 1], [5, 6, 1], [5, 7, 1]]}
coordinates:  (4320, 5, 3)

Only the filepath should need to be edited back in for this to run. I also wrote a different parser than the example code because I was having trouble getting the iteration to perform consistently, so maybe I introduced an unintended error there.

I will filter my parquet files for similar mismatches and go-ahead without them for now. If I have made an obvious mistake or if this has already been identified I'd still appreciate feedback.

Thanks!