vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
237 stars 51 forks source link

Question about the filter steps between the main report and the matrix in DIANN 1.9 #1056

Open momo-0521 opened 1 week ago

momo-0521 commented 1 week ago

Hi Vadim

Thanks for your work in DiaNN 1.9. When analyzing the results from version 1.9, I've observed discrepancies between the number of Protein.Group entries filtered by R and those reported in report.pg_matrix. Are there additional filtering steps being applied? I suspect that the "Additional 5% run-specific protein-level FDR filter applied to the protein matrices, use --matrix-spec-q to adjust it" might be impacting the results. However, I'm unsure how to address this issue.

report_pg <- diann_load("report.pg_matrix.tsv") length(unique(report_pg$Protein.Group)) [1] 13121 df<-read_parquet("report.parquet") length(unique(df$Protein.Group[df$Lib.Q.Value <= 0.01 & df$Lib.PG.Q.Value <= 0.01 ]))#14126 [1] 14126

Thank you in advance

vdemichev commented 1 week ago

Hi,

Please try: df<-read_parquet("report.parquet") length(unique(df$Protein.Group[df$Lib.Q.Value <= 0.01 & df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05]))

Best, Vadim

momo-0521 commented 1 week ago

Thank you for your advice。

I have tried this, but it does not work.It affected the number of precursors but had no effect on the entries in Protein.Group.

df<-read_parquet("report.parquet") length(unique(df$Protein.Group[df$Lib.Q.Value <= 0.01 & df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05])) [1] 14126

Thank you again! T

vdemichev commented 1 week ago

Is this MBR output?

momo-0521 commented 1 week ago

Yes, it is MBR output.

vdemichev commented 1 week ago

Can you please share both the .parquet and pg_matrix? A quick check: do the timestamps (date modified) on those files match?

Best, Vadim

momo-0521 commented 1 week ago

Thank you! Please find the file in Google Cloud. https://drive.google.com/file/d/1TAU2fQ1pnf4PXOqAlVVFMu4zM3Vg4L-Q/view?usp=sharing https://drive.google.com/file/d/1jd-vLFXjsfTy4_dgd-ztEwd8RzqsXoD_/view?usp=sharing

vdemichev commented 1 week ago

length(unique(df$Protein.Group[df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05 & df$PG.MaxLFQ > 0])) [1] 13121

Works if filter for non-zero quantities too :)

momo-0521 commented 1 week ago

Thank you very much for your great help.

Best wishes!

momo-0521 commented 1 week ago

Hi, Vadim

Thanks for your help yesterday. I have encountered a new question. When I utilized ‘diann_maxlfq’ to estimate protein group quantities, the results appear to differ significantly from those obtained from 'pg_matrix' as well as the 'PG.MaxLFQ' column. Below is the code I employed, which functioned correctly in DIANN 1.8 but has raised some concerns in DIANN 1.9. Do you have any suggestions or advice on this issue? protein.groups <- diann_maxlfq(df[df$Lib.PG.Q.Value <= 0.01 & df$PG.Q.Value <= 0.05 & df$PG.MaxLFQ > 0,], sample.header = "Run", group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")

Thank you in advance!

vdemichev commented 1 week ago

diann_maxlfq implements a simple MaxLFQ algorithm, different from what DIA-NN uses internally. The results will therefore always differ.

momo-0521 commented 1 week ago

Thank you. I understand.

Another question is about species-specifc precursors. Our samples contain a mixture of human and mouse proteins. When running DIANN 1.9, we used both human and mouse FASTA files and add additional options including '--species-genes' and '--species-ids'. We would like to exclude precursors specific to mouse or shared between both species, and instead focus only on human-specific precursors to quantify their associated proteins. Under these parameter settings, we would like to know if the 'PG.MaxLFQ' value is calculated from human-specific and mouse-specific precursors?

Best wishes!

vdemichev commented 1 week ago

It's calculated using all precursors matched to the protein group (Protein.Group column). So in this case you'd want to just discard all entries in the .parquet report with Protein.Ids column string containing 'MOUSE'.