medbioinf / pia

:books: :microscope: PIA - Protein Inference Algorithms
https://github.com/medbioinf/pia
Other
22 stars 9 forks source link

Question about the -proteinExport function #108

Closed cksakura closed 6 years ago

cksakura commented 6 years ago

Hello, I tried the -proteinExport function for processing the sample data and found problems in the result. I used the command line -infile yeast-gold-015-filtered.pia.xml -paramFile parameter.xml -proteinExport yeast-gold-015-filtered.csv csv, the parameter file is the sample provided at https://github.com/mpc-bioinformatics/pia/wiki/parameters-XML-file and I changed the score names.

`<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

This file will contains a pipeline execution for PIA ` In the generated result file, the score column is "NaN". `accessions | score | #peptides | #PSMs | #spectra

P36071 | NaN | 1 | 1 | 1

P25294 | NaN | 3 | 12 | 12

P48415 | NaN | 1 | 1 | 1

P38249 | NaN | 6 | 12 | 12` But there were score values in the sample result file yeast-gold-015-filtered-proteins.csv, and three columns "isDecoy" "FDR" "q-value" were missed in my result. I'm not sure in which step I failed in the process. Thank you for your help. Kai Cheng

julianu commented 6 years ago

Hello Kai Cheng, I will have a look into the problem when I am back from the christmas holidays. Please stand by...

cksakura commented 6 years ago

Hello Uszkoreit,

Thank you very much, please take your time. Happy holidays!

Best regards, Kai

On 28 December 2017 at 14:54, Julian Uszkoreit notifications@github.com wrote:

Hello Kai Cheng, I will have a look into the problem when I am back from the christmas holidays. Please stand by...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mpc-bioinformatics/pia/issues/108#issuecomment-354348613, or mute the thread https://github.com/notifications/unsubscribe-auth/AGlSokFnfpaZcUcmU-VXsUwFQikG0CbJks5tE_HYgaJpZM4RL9nI .

julianu commented 6 years ago

Hej Kai,

Sorry for the long delay, but here are the answers to your problem: 1) In the PSMAddPreferredFDRScore tags, the "short versions" of the score names must be used. At https://github.com/mpc-bioinformatics/pia/wiki/Score-Names is a (not complete) list of these. So, if you wanted to use the pre-calculated FDR q-values, psm_q_value would be correct, and xtandem_expect for the X!Tandem expect values. Nevertheless, I would for this file rather suggest to use xtandem_expect and msgf_specevalue (MSGF:Spectrum e-value) for the calculation of the FDR scores. 2) In the ProteinInfereProteins tag, psm_combined_fdr_score must be used instead of combined_fdr_score. This was wrong (from older versions) in the Wiki page, but is corrected now. Sorry for this, but especially the command line documentation is not as updated as it should be.

I attached the modified workflow XML file. Please let me know, whether this works for you.

Best, Julian

pia_workflow-changed.zip

cksakura commented 6 years ago

Hello Julian,

I'm very appreciate for your help, it works well now. May I ask a question? From the web server of PIA, I find in the ProteinInfereProteins tag, there are two options for "used spectra", "best" and "all" (I'm not sure if more options exist). Which one dose you recommend? Generally, is there a set of "recommended parameters" for this task?

Thank you again for your response.

Best regards, Kai

On 16 January 2018 at 14:17, Julian Uszkoreit notifications@github.com wrote:

Hej Kai,

Sorry for the long delay, but here are the answers to your problem:

  1. In the PSMAddPreferredFDRScore tags, the "short versions" of the score names must be used. At https://github.com/mpc- bioinformatics/pia/wiki/Score-Names https://github.com/mpc-bioinformatics/pia/wiki/Score-Names is a (not complete) list of these. So, if you wanted to use the pre-calculated FDR q-values, psm_q_value would be correct, and xtandem_expect for the X!Tandem expect values. Nevertheless, I would for this file rather suggest to use xtandem_expect and msgf_specevalue (MSGF:Spectrum e-value) for the calculation of the FDR scores.
  2. In the ProteinInfereProteins tag, psm_combined_fdr_score must be used instead of combined_fdr_score. This was wrong (from older versions) in the Wiki page, but is corrected now. Sorry for this, but especially the command line documentation is not as updated as it should be.

I attached the modified workflow XML file. Please let me know, whether this works for you.

Best, Julian

pia_workflow-changed.zip https://github.com/mpc-bioinformatics/pia/files/1636449/pia_workflow-changed.zip

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mpc-bioinformatics/pia/issues/108#issuecomment-358073489, or mute the thread https://github.com/notifications/unsubscribe-auth/AGlSojECk1KVAZLM5DYO-p3O63lqapzrks5tLPW1gaJpZM4RL9nI .

julianu commented 6 years ago

Hej Kai,

There are only two options, the named "best" and "all". I would recommend "best", as otherwise you will have a heavy bias on proteins, which were found by many spectra, but maybe only few peptides. If that is, what you need, you can use "all".

For "recommended parameters", there are some, which are basically the ones set in the KNIME nodes (and also the slightly outdated web application). Also in the tutorial (https://github.com/mpc-bioinformatics/pia#tutorial) the most relevant parameters are explained. But I can imagine it is not that easy to understand them right away, so feel free to ask here, if anything is unclear.

Best regards, Julian

cksakura commented 6 years ago

Hello Julian,

Thanks a lot!

Best regards, Kai

On 17 January 2018 at 10:23, Julian Uszkoreit notifications@github.com wrote:

Hej Kai,

There are only two options, the named "best" and "all". I would recommend "best", as otherwise you will have a heavy bias on proteins, which were found by many spectra, but maybe only few peptides. If that is, what you need, you can use "all".

For "recommended parameters", there are some, which are basically the ones set in the KNIME nodes (and also the slightly outdated web application). Also in the tutorial (https://github.com/mpc-bioinformatics/pia#tutorial) the most relevant parameters are explained. But I can imagine it is not that easy to understand them right away, so feel free to ask here, if anything is unclear.

Best regards, Julian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mpc-bioinformatics/pia/issues/108#issuecomment-358338152, or mute the thread https://github.com/notifications/unsubscribe-auth/AGlSooL9ZRxIrGell-Vahf5m_7wWZTAwks5tLhB9gaJpZM4RL9nI .