New component: Retrieve MS-based information (Peptide Atlas)

yvandenb commented 6 years ago

Specification: Goal: to allow end-users to add MS-based info (from PeptideAtlas) to their protein list & to check whether or not their protein have been experimentally observed in a given human tissue/sample at the protein level

Proteomics datasets from which MS-based information will be extracted Source file: home-made (by YV) for rapid protyping purpose (see comment at the end of this issue) - protein set for each tissue ("build") retrieved via https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/buildInfo?_subtab=2 then retrieving canonical proteins for a given human build (i.e. a given sample/tissue) by clicking on the column "Canonical proteins" with respect to the right build (one by row) (for example for human plasma non-glyco: https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/GetProteins?atlas_build_id=465&presence_level_constraint=1&redundancy_constraint=4&biosequence_name_constraint=!DECOY%%3B!CONTAM%&apply_action=QUERY) then download txt file generated by the query (via PeptideAtlas GUI) then gathered and deposited on bioproj => Source file location: https://bioproj.extra.cea.fr/redmine/projects/proteore/repository/changes/Use_Cases_ProteoRE/UseCase2_HumanBiomarkerSelection/ Presently, 6 tissues/sample proteomes are of interest and have been retrieved manually, namely:
1. Human plasma non glycosylated - Human_Plasma_NonGlyco_201803_PeptideAtlas
2. Human urine - Human_Urine_201803_PeptideAtlas.txt
3. Human brain - Human_Brain_201803_PeptideAtlas
4. Human heart - Human_Heart_201803_PeptideAtlas
5. Human kidney - Human_Kidney_201803_PeptideAtlas
6. Human liver - Human_Liver_201803_PeptideAtlas

For each table (PA source file in tab format depostied in bioproj), the information to be used (for now) are: Col. A: The Uniprot Accession number (“biosequence_name” in the PA source file) Col. F: an integer (“n_observations” in the PA source file)

Submission form: • Input • Copy/paste protein ID (Uniprot accession number) or a tabular file (with a column number option indicating Uniprot accession number required as IDs + header yes/no) • Options* (Select proteomics dataset (sample) (name below organized using a Radio-button menu)

Human plasma non glycosylated => (corresponding file name: Human_Plasma_NonGlyco_201803_PeptideAtlas.txt)
Human urine => (Human_Urine_201803_PeptideAtlas.txt)
Human brain => (Human_Brain_201803_PeptideAtlas.txt)
Human heart => (Human_Heart_201803_PeptideAtlas.txt)
Human kidney => (Human_Kidney_201803_PeptideAtlas.txt)
Human liver => (Human_Liver_201803_PeptideAtlas.txt) One to all options can be selected by the end-user • Output Columns 1..x : should contain the whole content of the input file used (or the original ID list in a copy/paste mode in col.1) Additional columns:
Name “Nb of times peptide Observed_”THE_FIRST_TISSUE_NAME_SELECTED” integer (correspond to the col. F “n_observations” in the PA source file)
Name “Nb of times peptide Observed_”THE_SECOND_TISSUE_NAME_SELECTED” integer
Etc…

User doc section: will follow... For any further details, feel free to call me

N.B. : source file from Peptide Atlas (PA) are usually in the form of xml file called "build" to which an id is assigned (build_id) - see http://www.peptideatlas.org/builds/ for a complete picture of what is available. As each xml file is (very) large, the current idea would be to either post-process the xml once downloaded from PeptideAtlas or retrieve info using a query via the API of nextprot (which also gather info form PeptideAtlas) - I suggest to discuss this aspect afterwards as the only thing we need at the moment, is to prototype the behavior and the GUI to better figure out what should be improved with the Use Case 2 (see issue #84)

NguyenLien commented 6 years ago

@yvandenb It's not complicated for extracting the info from source files (home-made by Yv). But based on your query to NP in #84, we can get the entry for each ID so it can avoid to download the whole Peptide Atlas. But I haven't understood how to extract the information from the result entry. Do you want me to first build a component based on your home-made source files then investigate in NP query, or to directly investigate in NP query?

yvandenb commented 6 years ago

A very good question that you raised Lien...Btw, I had a discussion about this matter with Lydie Lane (NP's PI) on last Monday; obviously it would be easier and advantageous to work using information from NP for many reasons: data curated, high content, data richness, advanced query using SPARQL via API...and a very good relationship ! This is actually what we did with Lisa when she prototyped the "Protein features" and still of interest for updating NP info we needed; BUT in the case of MS-based information needed for the UC2 (i.e. "nbr of psm observed" in what tissue (in fact "build)), Lydie confirmed that NP does not integrated these info in their RDF model - This is why we still need to consider info from PA, and the most simple way to retrieve it - I sent yesterday a msg to PA manager and got an answer (that I'am going to forward you) - Thus, at the moment, my suggestion would be to first build a tool based on my home-made source files...

NguyenLien commented 6 years ago

The first version of this component is now available in dev instance !

yvandenb commented 6 years ago

Let's have a look :+1:

yvandenb commented 6 years ago

Btw, find below mails I had with the staff of Peptide Atlas

Hi Yves, What you can do is do the query for each tissue type you are interested in. Below link is for Brain.

https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/GetPeptides?atlas_build_id=472&display_options=ShowMappings&organism_id=2&sample_category_id=2&QUERY_NAME=AT_GetPeptides&output_mode=tsv&apply_action=QUERY

The brain is specified as sample_category_id=2 in the link. You can get full list of sample_category_id here:

https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/ManageTable.cgi?TABLE_NAME=AT_sample_category

Zhi -----Original Message----- From: VANDENBROUCK Yves 206108 [mailto:yves.vandenbrouck@cea.fr] Sent: Wednesday, March 14, 2018 9:38 AM To: Zhi Sun Cc: Eric Deutsch Subject: RE: Human PeptideAtlas download

Dear Zhi, Dear Eric,

Thank you for your answer; so, I'd need to retrieve ms-based information related to a list of human proteins such as the "nbr of psm observed" in a given tissue/sample of interest...I actually did it by parsing available .xml files corresponding to older builds (as reported), and I now would like to update these info using the most recent version of the human PA build...not sure it would be feasible via the query interface in a batch mode, wouldn't it? Regards, Yves

Yves Vandenbrouck, PhD Etude de la Dynamique des Protéomes (EDyP) Laboratoire Biologie à Grande Echelle (BGE) U1038 INSERM/CEA/UGA Biosciences and Biotechnology Institute of Grenoble (BIG) CEA/Grenoble -----Message d'origine----- De : Zhi Sun [mailto:zsun@systemsbiology.org] Envoyé : mercredi 14 mars 2018 17:22 À : VANDENBROUCK Yves 206108 yves.vandenbrouck@cea.fr Cc : Eric Deutsch edeutsch@systemsbiology.org Objet : RE: Human PeptideAtlas download

Hi Yves, The xml file is not generated. Can you let me know what you need? Maybe we can get information through PeptideAtlas query interface.

Thanks, Zhi -----Original Message----- From: Yves VANDENBROUCK yves.vandenbrouck@cea.fr Dear colleagues, I tried to download the latest version of the Human build (Jan 2018 - XML file) via this web page: http://www.peptideatlas.org/builds/ and was redirected to this web page http://www.peptideatlas.org/builds/human/201712/atlas_build_472.xml.gz woth the following error msg: "Not Found The requested URL /builds/human/201712/atlas_build_472.xml.gz was not found on this server." Please could you help me with that and provide me with the right link?

yvandenb commented 6 years ago

Ok Lien , this new tool works fine ! bravo..just two points now need to be improved:

[ ] Following my exchanges with the PA staff, we have to update the source files (that are a bit different from those you used for this 1rst version and consequently that is going to impact the submission form see point 2 below) - As we now agreed on what we are manipulating in terms of PA data (and how to retrive them via a PA query) I created a new issue describing the procedure to create them #93
[ ] submission form for this tool needs few enhancement: just to keep tracks: precise the ID type required (Uniprot Accession number) and the user doc section (assigned to me as usual ;-))

yvandenb commented 6 years ago

User doc for: "Retrieve MS-based information at the peptide level add MS-based annotation to your protein list from Peptide Atlas" New title => "Retrieve MS-based information at the peptide level (from Peptide Atlas)" Given a list of Uniprot accession number the tool allows to retrieve MS-based information for each peptide identified for a given protein. Could be of interest for people who wish to select peptides for further targeted MS-based experiments (i.e. if the protein is detectable in the sample, it will be detected via that peptide).

Input required: A list of Uniprot accession number (e.g. Q12860) provided either in the form of a file (if you choose a file, it is necessary to specify the column where are your Uniprot accession number) or in a copy/paste mode. If your input file or list contains other type of IDs, please use the ID_Converter tool to convert yours into Uniprot accession number. Output: An output is returned for each selected proteomics sample (indicated by the name of the output in the history panel) containing the list of peptides identified for each protein requested with the following additional information:

peptide_accession: peptide accession number assigned by Peptide Atlas
peptide_sequence: amino-acidic sequence of the peptide
n_observations: number of time this peptide has been observed in the sample
empirical_proteotypic_score: the likelihood for a peptide to be proteotypic (from 0 to 1; the higher the score the higher the proteotypic propensity)
SSRCalc_relative_hydrophobicity: predicted hydrophobicity computed using sequence-specific retention calculator (SSRCalc) algorithm. A low score indicates a low hydrophobicity.

Data were retrieved from Peptide Atlas release (Jan 2018)

next "user doc" (protein-level) coming soon ;-)

yvandenb commented 6 years ago

User doc: Retrieve MS-based information at the protein level add MS-based annotation to your protein list from Peptide Atlas New title => Number of MS/MS observations in sample (from Peptide Atlas) Given a list of Uniprot accession number this tool indicates the number of times a protein has(ve) been observed in a given sample using LC-MS/MS proteomics approach. Could be of interest for people who wants to know to what extent a protein is detectable (and to roughly estimate its level) in a given sample using proteomics. Available human biological samples are the following: brain, heart, kidney, liver, plasma, urine and cerebrospinal fluid (CSF). Data were retrieved from Peptide Atlas release (Jan 2018).

Input required: A list of Uniprot accession number (e.g. Q12860) provided either in the form of a file (if you choose a file, it is necessary to specify the column where are your Uniprot accession number) or in a copy/paste mode. If your input file or list contains other type of IDs, please use the ID_Converter tool to convert yours into Uniprot accession number. Output: Additional columns are created for each selected proteomics sample reporting the number of times all peptides corresponding to a protein have been observed by LC-MS/MS according to Peptide Atlas. “NA” means that no information has been reported suggesting that this protein has not been observed in the sample of interest.

NguyenLien commented 6 years ago

Done !

vloux / ProteoRE

New component: Retrieve MS-based information (Peptide Atlas) #90