vloux / ProteoRE

GNU General Public License v3.0
2 stars 5 forks source link

New component: Retrieve MS-based information (Peptide Atlas) #90

Closed yvandenb closed 6 years ago

yvandenb commented 6 years ago

Specification: Goal: to allow end-users to add MS-based info (from PeptideAtlas) to their protein list & to check whether or not their protein have been experimentally observed in a given human tissue/sample at the protein level

For each table (PA source file in tab format depostied in bioproj), the information to be used (for now) are: Col. A: The Uniprot Accession number (“biosequence_name” in the PA source file) Col. F: an integer (“n_observations” in the PA source file)

Submission form: • Input • Copy/paste protein ID (Uniprot accession number) or a tabular file (with a column number option indicating Uniprot accession number required as IDs + header yes/no) • Options* (Select proteomics dataset (sample) (name below organized using a Radio-button menu)

  1. Human plasma non glycosylated => (corresponding file name: Human_Plasma_NonGlyco_201803_PeptideAtlas.txt)
  2. Human urine => (Human_Urine_201803_PeptideAtlas.txt)
  3. Human brain => (Human_Brain_201803_PeptideAtlas.txt)
  4. Human heart => (Human_Heart_201803_PeptideAtlas.txt)
  5. Human kidney => (Human_Kidney_201803_PeptideAtlas.txt)
  6. Human liver => (Human_Liver_201803_PeptideAtlas.txt) One to all options can be selected by the end-user • Output Columns 1..x : should contain the whole content of the input file used (or the original ID list in a copy/paste mode in col.1) Additional columns:
  7. Name “Nb of times peptide Observed_”THE_FIRST_TISSUE_NAME_SELECTED” integer (correspond to the col. F “n_observations” in the PA source file)
  8. Name “Nb of times peptide Observed_”THE_SECOND_TISSUE_NAME_SELECTED” integer
  9. Etc…

User doc section: will follow... For any further details, feel free to call me

N.B. : source file from Peptide Atlas (PA) are usually in the form of xml file called "build" to which an id is assigned (build_id) - see http://www.peptideatlas.org/builds/ for a complete picture of what is available. As each xml file is (very) large, the current idea would be to either post-process the xml once downloaded from PeptideAtlas or retrieve info using a query via the API of nextprot (which also gather info form PeptideAtlas) - I suggest to discuss this aspect afterwards as the only thing we need at the moment, is to prototype the behavior and the GUI to better figure out what should be improved with the Use Case 2 (see issue #84)

NguyenLien commented 6 years ago

@yvandenb It's not complicated for extracting the info from source files (home-made by Yv). But based on your query to NP in #84, we can get the entry for each ID so it can avoid to download the whole Peptide Atlas. But I haven't understood how to extract the information from the result entry. Do you want me to first build a component based on your home-made source files then investigate in NP query, or to directly investigate in NP query?

yvandenb commented 6 years ago

A very good question that you raised Lien...Btw, I had a discussion about this matter with Lydie Lane (NP's PI) on last Monday; obviously it would be easier and advantageous to work using information from NP for many reasons: data curated, high content, data richness, advanced query using SPARQL via API...and a very good relationship ! This is actually what we did with Lisa when she prototyped the "Protein features" and still of interest for updating NP info we needed; BUT in the case of MS-based information needed for the UC2 (i.e. "nbr of psm observed" in what tissue (in fact "build)), Lydie confirmed that NP does not integrated these info in their RDF model - This is why we still need to consider info from PA, and the most simple way to retrieve it - I sent yesterday a msg to PA manager and got an answer (that I'am going to forward you) - Thus, at the moment, my suggestion would be to first build a tool based on my home-made source files...

NguyenLien commented 6 years ago

The first version of this component is now available in dev instance !

yvandenb commented 6 years ago

Let's have a look :+1:

yvandenb commented 6 years ago

Btw, find below mails I had with the staff of Peptide Atlas

Hi Yves, What you can do is do the query for each tissue type you are interested in. Below link is for Brain.

https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/GetPeptides?atlas_build_id=472&display_options=ShowMappings&organism_id=2&sample_category_id=2&QUERY_NAME=AT_GetPeptides&output_mode=tsv&apply_action=QUERY

The brain is specified as sample_category_id=2 in the link. You can get full list of sample_category_id here:

https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/ManageTable.cgi?TABLE_NAME=AT_sample_category

Zhi -----Original Message----- From: VANDENBROUCK Yves 206108 [mailto:yves.vandenbrouck@cea.fr] Sent: Wednesday, March 14, 2018 9:38 AM To: Zhi Sun Cc: Eric Deutsch Subject: RE: Human PeptideAtlas download

Dear Zhi, Dear Eric,

Thank you for your answer; so, I'd need to retrieve ms-based information related to a list of human proteins such as the "nbr of psm observed" in a given tissue/sample of interest...I actually did it by parsing available .xml files corresponding to older builds (as reported), and I now would like to update these info using the most recent version of the human PA build...not sure it would be feasible via the query interface in a batch mode, wouldn't it? Regards, Yves


Yves Vandenbrouck, PhD Etude de la Dynamique des Protéomes (EDyP) Laboratoire Biologie à Grande Echelle (BGE) U1038 INSERM/CEA/UGA Biosciences and Biotechnology Institute of Grenoble (BIG) CEA/Grenoble -----Message d'origine----- De : Zhi Sun [mailto:zsun@systemsbiology.org] Envoyé : mercredi 14 mars 2018 17:22 À : VANDENBROUCK Yves 206108 yves.vandenbrouck@cea.fr Cc : Eric Deutsch edeutsch@systemsbiology.org Objet : RE: Human PeptideAtlas download

Hi Yves, The xml file is not generated. Can you let me know what you need? Maybe we can get information through PeptideAtlas query interface.

Thanks, Zhi -----Original Message----- From: Yves VANDENBROUCK yves.vandenbrouck@cea.fr Dear colleagues, I tried to download the latest version of the Human build (Jan 2018 - XML file) via this web page: http://www.peptideatlas.org/builds/ and was redirected to this web page http://www.peptideatlas.org/builds/human/201712/atlas_build_472.xml.gz woth the following error msg: "Not Found The requested URL /builds/human/201712/atlas_build_472.xml.gz was not found on this server." Please could you help me with that and provide me with the right link?

yvandenb commented 6 years ago

Ok Lien , this new tool works fine ! bravo..just two points now need to be improved:

yvandenb commented 6 years ago

User doc for: "Retrieve MS-based information at the peptide level add MS-based annotation to your protein list from Peptide Atlas" New title => "Retrieve MS-based information at the peptide level (from Peptide Atlas)" Given a list of Uniprot accession number the tool allows to retrieve MS-based information for each peptide identified for a given protein. Could be of interest for people who wish to select peptides for further targeted MS-based experiments (i.e. if the protein is detectable in the sample, it will be detected via that peptide).

Input required: A list of Uniprot accession number (e.g. Q12860) provided either in the form of a file (if you choose a file, it is necessary to specify the column where are your Uniprot accession number) or in a copy/paste mode. If your input file or list contains other type of IDs, please use the ID_Converter tool to convert yours into Uniprot accession number. Output: An output is returned for each selected proteomics sample (indicated by the name of the output in the history panel) containing the list of peptides identified for each protein requested with the following additional information:

Data were retrieved from Peptide Atlas release (Jan 2018)

next "user doc" (protein-level) coming soon ;-)

yvandenb commented 6 years ago

User doc: Retrieve MS-based information at the protein level add MS-based annotation to your protein list from Peptide Atlas New title => Number of MS/MS observations in sample (from Peptide Atlas) Given a list of Uniprot accession number this tool indicates the number of times a protein has(ve) been observed in a given sample using LC-MS/MS proteomics approach. Could be of interest for people who wants to know to what extent a protein is detectable (and to roughly estimate its level) in a given sample using proteomics. Available human biological samples are the following: brain, heart, kidney, liver, plasma, urine and cerebrospinal fluid (CSF). Data were retrieved from Peptide Atlas release (Jan 2018).

Input required: A list of Uniprot accession number (e.g. Q12860) provided either in the form of a file (if you choose a file, it is necessary to specify the column where are your Uniprot accession number) or in a copy/paste mode. If your input file or list contains other type of IDs, please use the ID_Converter tool to convert yours into Uniprot accession number. Output: Additional columns are created for each selected proteomics sample reporting the number of times all peptides corresponding to a protein have been observed by LC-MS/MS according to Peptide Atlas. “NA” means that no information has been reported suggesting that this protein has not been observed in the sample of interest.

NguyenLien commented 6 years ago

Done !