Chain type determination fix

RJPenic commented 5 months ago

First of all, big thanks for your work! I am sure this dataset will greatly benefit the field of RNA structure prediction! :)

Current iteration of RNA3DB contains several DNA structures (e.g. 6O9E_E) because of the chain type determination bug in rna3db/parser.py. More precisely, current iteration of RNA3DB pipeline uses "_chem_comp.type" mmCIF key to determine the type of the chain and this seemingly causes some DNA chains to be falsely classified as RNAs. I assume this is not intentional (?) so I fixed this bug by determining the chain type with the "_entity_poly.type" key instead.

It is worth noting that alongside few DNAs, there are also few DNA/RNA hybrids in the dataset (e.g. 8SVF_I). I wasn't sure if these should be removed as well so the fixes did not affect these types of chains (they aren't filtered out) but with minor code adjustments this could be changed.

I haven't tested the fixes throughly but the test script I provided below seems to be working fine:

from rna3db.parser import mmCIFParser
from rna3db.parser import ModificationHandler
from pathlib import Path

parser = mmCIFParser(Path("./6O9E.cif"), ModificationHandler("cc_dict.json"))
print("6O9E RNA chains:")
print(parser.chains)

parser = mmCIFParser(Path("./8SVF.cif"), ModificationHandler("cc_dict.json"))
print("8SVF RNA chains:")
print(parser.chains)

New output:

6O9E RNA chains:
{}
8SVF RNA chains:
{'I': Chain(author_id=I, len=187), 'J': Chain(author_id=J, len=327)}

Old output:

6O9E RNA chains:
{'E': Chain(author_id=E, len=38), 'F': Chain(author_id=F, len=38)}
8SVF RNA chains:
{'I': Chain(author_id=I, len=187), 'J': Chain(author_id=J, len=327)}

marcellszi commented 4 months ago

Hi @RJPenic,

Thanks for your kind words and pull request. I apologise for getting back to you so late.

We are actually aware that a few DNA structures leaked through. However, the behaviour that you describe was originally intentional.

We specifically look for _chem_comp.type to include "RNA linking" residues, so in theory at least one residue in those DNA chains must be "classified as an RNA residue". For example, in 6o9e_F, positions 7 and 9 are OMC, which is classified as "RNA linking". To be honest, I'm not sure why this happens. In this case this is definitely a DNA.

The thought process behind filtering by _chem_comp.type was to make sure we never miss any RNAs. Keeping a few DNA helices probably doesn't really hurt the dataset much, and I'm happy to trade that for making sure we never miss an RNA.

I would be happy to merge this, but I want to understand what is happening better. Do you know if your method ever excludes any chains that may actually be RNA (whether it's due to a mislabelling of _entity_poly.type, etc.)?

RJPenic commented 4 months ago

I downloaded PDB mmCIFs and parsed them. First I did it with the _chem_comp.type approach and then with the _entity_poly.type approach. Finally, I extracted all structure IDs from generated JSONs and compared the sets of found structure IDs in both approaches. To keep the comparison somewhat readable I ignored structures with less than 32 nucleotides.

Structures found with _chem_comp.type but not with _entity_poly.type (79 structures):

{'8j1q_E', '6ik9_F', '6kdo_F', '6o9e_E', '6kdk_E', '5i42_E', '5hlf_E', '8e9a_D', '7dbn_E', '5hlf_F', '7z2g_E', '6kdm_F', '7ozw_F', '6o9e_F', '7s4x_D', '7p15_F', '6kdj_F', '5i3u_F', '5xn1_E', '7lsk_F', '7z29_E', '4ni9_D', '7dbm_F', '6ika_F', '8j26_E', '7lri_E', '6vpc_D', '6kdo_E', '5hrt_B', '6ik9_E', '6kdm_E', '5tzs_z', '6ika_E', '6vz3_5', '6vmi_A7', '7z2e_E', '5i3u_E', '8e9a_C', '4ni7_B', '7z2d_E', '7lrm_E', '5tzs_A', '7s4v_D', '7lrm_F', '7z24_E', '7lsk_E', '7dbm_E', '8j1q_D', '5hp1_E', '6vug_E', '6kdj_E', '7ml4_T', '7lrx_F', '7jtq_B', '6kdk_F', '6kdn_F', '8j26_D', '6vlz_A7', '6kdn_E', '7lrx_E', '7lry_E', '5d3g_E', '5d3g_F', '7lry_F', '5hro_F', '5xn1_F', '7lyt_D', '5i42_F', '8er8_B', '5tzs_B', '5hp1_F', '7dbn_F', '6wkr_H', '6vmi_A6', '5hro_E', '7jtq_D', '7lri_F', '4ni9_B', '7z2h_E'}

Structures found with _entity_poly.type but not with _chem_comp.type (2 structures):

{'7z1m_T', '7z1m_S'}

Most of the structures found with _chem_comp.type but not with _entity_poly.type are DNA chains with modified residues such as OMC. I haven't noticed any RNAs. On the other hand, when I used _entity_poly.type, I found two structures that weren't parsed before. These are labeled as RNA-DNA hybrids.

marcellszi commented 2 days ago

Merged in anticipation of new release.

Thanks again for the pull request.

marcellszi / rna3db

Chain type determination fix #8