smith-chem-wisc / MetaMorpheus

Proteomics search software with integrated calibration, PTM discovery, bottom-up, top-down and LFQ capabilities
MIT License
90 stars 45 forks source link

Reading in isoforms and SNVs from UniProt XML databases #1842

Open rmillikin opened 4 years ago

rmillikin commented 4 years ago

Apparently we're not reading these in, though we do read in the protein's annotated modified residues. It may or may not be a good idea to read in all SNVs, etc. but maybe we should provide options to search these annotated species.

Counting entries in a human canonical reviewed XML:

Entries: 20397 proteins "isoform" XML tags: 32925 (though it seems that perhaps this includes the canonical sequences?) "modified residue" tags: 52891 "sequence variant" tags: 79669

acesnik commented 4 years ago

I agree this could be interesting functionality. However, I believe UniProt's sequence variant tags won't be very useful for searches because they're from all sorts of samples; Gloria Sheynkman showed something similar with dbGaP in her 2014 article.

For isoforms, one can search the FASTA canonical & isoform database and use GPTMD to get modifications.