t7morgen / misato-dataset

GNU Lesser General Public License v2.1
159 stars 17 forks source link

(QUESTION) Sequence length mismatch #7

Closed jyaacoub closed 7 months ago

jyaacoub commented 8 months ago

Hi,

I am not sure if I properly understand how the dataset was created and would like some clarification.

Looking through MD.hdf5 file we have a key for atoms_type which can be used to get the residue sequence length by counting the number of occurrences of '6', as per the atoms_type_map.pickle file which maps 6 to CA atom.

When I do this, the count is never the same as the true sequence length from PDB nor does it match the sequence length of the pocket file provided by PDBbind.

For example for 2G6P I get the following for sequence length from the MD.hdf5 file

CA count: 150

# sequence length from PDBbind files:
304 <Chain 2g6p_protein:A>
48 <Chain 2g6p_pocket:A>

code:

MD_fp = f"{misato_dir}/MD.hdf5"
pro_test = ProtDataset(MD_fp, test, transform=GNNTransformMD())
p_id = '2G6P'.lower()
sample = pro_test.f[p_id.upper()]
seq = list(sample['atoms_residue'])
print("CA count:", list(sample['atoms_type']).count(6)) # count of CA

Did I miss something here? Where exactly is this sequence from? Is there some independent binding pocket identification that I am unaware of?

Thanks, Jean

pujaltes commented 7 months ago

Hey Jean,

I had similar issues when first looking into the dataset. As mentioned in #1 the data is provided in AMBER (protein) and gaff2 (ligand) format and care must be taken to convert it to a familiar format. Thankfully, the authors have provided a script for converting the AMBER data to PDB format here.

In your case, the important map dictionary to look at is the nameMap (atoms_name_map_for_pdb.pickle) where you can see that the CA atoms exclusively correspond to CX and vice-versa. See this example:

image

Following your example and looking back to the atoms_type_map.pickle dictionary we can see that CX corresponds to the number 14. With this in mind, you can recover the correct number of CA atoms in the following way:

from datasets import ProtDataset
from transformMD import GNNTransformMD
from h5_to_pdb import get_maps

mapdir = "misato-dataset/data/processing/Maps/"
residueMap, typeMap, nameMap = get_maps(mapdir)

p_id = '2G6P'.lower()
sample = pro_test.f[p_id.upper()]
seq = [residueMap[i] for i in sample['atoms_residue']]
atom_type = [typeMap[i] for i in sample['atoms_type']]  # convert atom_type from int to string representation

print("CA count:", atom_type.count('CX')) # count of CA
jyaacoub commented 7 months ago

Thank you for the reply! I am not too familiar with AMBER, so this was helpful.