prody / ProDy

A Python Package for Protein Dynamics Analysis
http://prody.csb.pitt.edu
Other
431 stars 156 forks source link

Does ProDy support identifying missing residues from .cif files? #1685

Open jonathanking opened 1 year ago

jonathanking commented 1 year ago

I need to be able to parse .cif files so that I have access to the complete protein chain sequence along with annotations for which residues are unobserved or 0 occupancy.

For example, for a protein sequence WWWGAPGAPGAPWWW where GAPGAPGAP are experimentally unresolved residues, I want to determine the missing sequence mask from this data, e.g. +++---------+++.

Does ProDy have tools to support accessing/constructing this information?

This is what I have so far, where I can parse the .cif file and its header, but I'm unsure how to access missing residue information. Calling .getOccupancies() on the atom group is not what I want either since zero occupancy residues seemingly have not been included.

ag, header = pr.parseMMCIF('7E1B.cif', chain="A", header=True)
complete_sequence_with_missing_residues = header['A'].sequence

cif file: https://files.rcsb.org/view/7E1B.cif

Thanks for your assistance.

jamesmkrieger commented 1 year ago

We have a function alignTwoSequencesWithBiopython that you could use on your complete sequence and ag.ca.getSequence() the msa and/or indices returned may be helpful for creating what you need.

jonathanking commented 1 year ago

Thank you! There is a field in the cif file called _pdbx_unobs_or_zero_occ_residues that records some of this information. Is it possible to access this data via prody? I'm not sure if fields like this are parsed when header=True, for example.

jamesmkrieger commented 1 year ago

No, I don’t think it can be parsed with header=True because the function underneath reproduces pdb header parsing and gives an object with a particular structure

however, cif is a type of star format so you should be able to use parseSTAR and navigate the hierarchical dictionary object that you get from that to get there.

If you get stuck, let me know and I’ll see if I can help figure it out

jamesmkrieger commented 1 year ago

Perhaps, I can add/extend an option of some particular keys to pass

jamesmkrieger commented 1 year ago

I've now added both a generic option to parse data with any key and a specific one to get an alignment of unobserved residues. These cannot be used from parsePDB with header=True. They have to be used in parseCIFHeader.

Please check #1705 for more details and let me know if this does what you'd like it to. You can access these changes to test it by checking out the associated branch.

If you don't yet have a github version of prody, you can clone it from this branch directly as follows: git clone -b cif_header https://github.com/jamesmkrieger/ProDy.git ProDy

If you do have it then you can add my fork as a new remote and then check out the remote branch as follows:

git remote add james https://github.com/jamesmkrieger/ProDy.git
git checkout -b cif_header james/cif_header
jamesmkrieger commented 1 year ago

Hi @jonathanking,

Have you had a chance to try this?

jonathanking commented 1 year ago

Thanks for your help. Unfortunately, I have not. I proceeded with another direction for the project I was using this for. I think this would be helpful in the future, though!

Best, Jonathan On Aug 7, 2023 at 11:14 AM -0400, James Krieger @.***>, wrote:

Hi @jonathanking, Have you had a chance to try this? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

jamesmkrieger commented 1 year ago

Ok, thanks