Proteogenomics // PG.MaxLFQ vs Genes.MaxLFQ

NicoDrou commented 3 months ago

Dear Vadim,

I have an issue with the Genes.MaxLFQ in DIA-NN v1.9. In my data set, the Genes IGHG2 correspond to only one Protein.Group. However, the the quantUTM values for PG.MaxLFQ and Genes.MaxLFQ are completely different and I don't explain why.

I have also processed this data the old fashion way with maxLFQ based on the "Genes" and this is what I obtained:

can you shed some light on what's going on with Genes.MaxLFQ? Because from my understand, if the Protein.Group and the Genes are identical, then PG.MaxLFQ and Genes.MaxLFQ should also be the same.

Many thanks in advance. Regards, Nicolas

vdemichev commented 3 months ago

Hi Nicolas,

Can you please share the log and the .parquet report?

Best, Vadim

NicoDrou commented 3 months ago

the parquet is too big to be shared (>400Mo, >600 samples) Instead, i prepared extracted an exemple, but I need to give you some extra information :) P01859.csv

we built a custom proteogenomic database from a limited set of serum proteins and the most common single amino acids variant (SAAV). Therefore, I have multiple entries in my database for the protein P01859 (GN=IGHG2):

ca|NX_P01859-1-CAN|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-S257A|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-V161M|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-P72T|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-P72T|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-K96E|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-V161M|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-V188M|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-A257S|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2 cv|NX_P01859-1-V301I|IGHG2_HUMAN Immunoglobulin heavy constant gamma 2 OS=Homo sapiens OX=9606 GN=IGHG2

As consequence, the Protein.Ids and Protein.Group are built from all these entries. So when a Protein.Ids contain the -CAN tag, it means the precursor is from the canonical sequence. If the -CAN tag is absent, it means the precursor hold the mutation corresponding to Protein.Ids. Therefore, the idea is to use the Genes to group all the different variants and quantify the proteins by taking into account all precursors, including the mutant ones.

My cohort is a longitudinal cohort, and the figure i showed earlier are from the Subject 54. For more conveniance, i have added the column Subject and Time to the extract of the paraquet.

NicoDrou commented 3 months ago

240620_TRP_blood7.3-extended_report.log.zip

vdemichev commented 3 months ago

Thanks for sharing the log, it looks OK (except that mass accuracies need to be fixed, but this would not be the reason for what you observe here).

About the report, what you describe should not happen. Would be great if you could upload .parquet somewhere (or just its subset for the runs you used to produce the column plot with intensities you were showing), without it I am not sure how to approach looking into this issue.

NicoDrou commented 3 months ago

I did share the subset. in case it didn't work i drop it again here: P01859.csv

what do you mean the mass accuracies need to be fixed? where do you see that in the log? I used 10ppm for MS1 and 20 for MS2. From the pdf report, I am at ~5ppm. A few are between 5-10. Do you suggest I should lower the ppm error?

vdemichev commented 3 months ago

Sorry, not mass accuracies, scan window needs to be fixed. In this case it's inferred separately for different runs which might affect quantification.

Yes, I guess P01859 is sufficient, just looked into it. So what's happening, Genes.MaxLFQ in there is actually equal to Genes.Normalised, and is calculated with Top 1 method basically, not MaxLFQ. So for whatever reason DIA-NN reverted to Top 1 when calculating the quantities for this specific gene. I will try to see why this could have happened.

NicoDrou commented 3 months ago

Thank you for having a look :)

I will try to find a solution to share the complete paraquet file with you. also another question. Is there a quality metric dedicated to Genes.MaxLFQ? or should we use PG.MaxLFQ.Quality? I am not very much in favor of using it because in my very particular case, the Protein.Group are different from the Genes. Often mutant precursors have a low quantity.quality and PG.MaxLFQ.Quality score. I guess it is because their uniqueness in some sample only. So in my case. a Gene.MaxLFQ.Quality would be ideal :)

vdemichev commented 3 months ago

I guess no need to share the complete one. I am a bit puzzled why this is happening. In general it's kind of OK, but no idea why.

No, no metric for Genes.MaxLFQ.

Best, Vadim

NicoDrou commented 3 months ago

Dear Vadim,

I have created a link : https://surfdrive.surf.nl/files/index.php/s/ai0sdHHfQ3LdoRR I have noticed for not all proteins the Genes.Normalised has been copied in the Genes.MaxLFQ. I am reprocessing the data to see if it changes something.

NicoDrou commented 3 months ago

Dear Vadim, I did reprocess the data the same way, and of course same results. Did you do any progress in understanding what could have gone wrong?

Many thanks again for your help. Nicolas

vdemichev commented 3 months ago

Hi Nicolas,

This is in my todo list to take a look at what's happening. In general it's not a bug, DIA-NN is expected to switch to top 1 instead of MaxLFQ in certain cases, I am just not sure why is it doing it here.

Best, Vadim

vdemichev / DiaNN

Proteogenomics // PG.MaxLFQ vs Genes.MaxLFQ #1125