pkienzle / periodictable

Extensible periodic table for python
http://periodictable.readthedocs.org
Other
147 stars 37 forks source link

support post-translational modifications in FASTA #37

Open pkienzle opened 3 years ago

pkienzle commented 3 years ago

I would like to use NIST website (https://www.ncnr.nist.gov/resources/activation/) to calculate SLD of a protein containing phosphoserine. I have tried filling amino acid sequence of a protein on the website and used “J” for phosphoserine, however, it didn’t recognize the “J” as phosphoserine because I didn’t see any phosphorus in the chemical composition of the sample. So I was wondering if there is another way to include phosphoserine on the website.

Looking at wikipedia, J is used in FASTA to represent either L or I,[1] so I average them 50:50.[2]

I see that there are a number of post-translational modifications that may occur,[3] but I don't know which formats can represent them. I can imagine extending FASTA with an optional lower case translation code after each sequence element. For example, phosphoserine could be Sp rather than S. This would be easy enough to parse, but I would rather not invent a new format if one already exists.

Once the format is defined, and the parser[4] updated, the residue table[5] will need to be extended with new codes, volumes, chemical formulae (including labile hydrogen and charge), and name.

[1] FASTA: https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation [2] periodictable fasta 'J': https://github.com/pkienzle/periodictable/blob/master/periodictable/fasta.py#L351 [3] PTMs by residue: https://en.wikipedia.org/wiki/Posttranslational_modification#Common_PTMs_by_residue [4] FASTA parser: https://github.com/pkienzle/periodictable/blob/4fb8068cc94a96704646e14ef2aebf939697e164/periodictable/fasta.py#L198-L208 [5] residue table: https://github.com/pkienzle/periodictable/blob/4fb8068cc94a96704646e14ef2aebf939697e164/periodictable/fasta.py#L320-L354

pkienzle commented 3 years ago

Meanwhile, you can do this in stages. Enter the fasta sequence and press calculate then type in

nHPO3 + sample formula @ density

where n is the number of phosphorylized serine and sample formula + density is printed by the first calculation. The density will be wrong, but probably within uncertainty since (a) the number of SEP will be small relative to the total sequence and (b) the computed density is already a poor approximation given that it assumes perfectly packed residue volumes regardless of protein conformation.

pkienzle commented 3 years ago

An short term fix would be to allow fasta sequences in mixtures so that HO3P@2.2 + aa:S would be one phosphoserine.

[1] H2O3P density: http://www.chemspider.com/Chemical-Structure.2341689.html?rid=352b4aa5-d266-4a1f-87c4-98f363fe67b8&page_num=0