vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
283 stars 53 forks source link

How to use Protein.ids in downstream analysis #1224

Open Gambrian opened 1 month ago

Gambrian commented 1 month ago

Dear Vadim

In the diann results, most of the Protein.ids are uniprot accession ids (from the fasta file I used), but some are accession ids combined by ";" . Some are combined with cRAP-***, I think this is because they are searched in both the reference fasta file and camprotR_240512_cRAP_20190401_full_tags.fasta. Can I delete these results? Others are just two or more accession ids combined with ";". Why is this? I think these proteins cannot find the master protein because they have the same matching peptides? Can I use the first id instead in subsequent analysis?

Best

vdemichev commented 4 weeks ago

Hi Gambrian,

cRAP is the contaminant proteins ('Contaminant' option was selected in DIA-NN). Yes, it's OK to delete anything with cRAP in Protein.Ids.

Why is this?

Protein.Ids column, in contrast to Protein.Group, lists all proteins from which the peptide can originate.

Can I use the first id instead in subsequent analysis?

I suggest to use Protein.Group (proteins inferred using the maximum parsimony principle) or Genes (the corresponding genes) columns. In each case, if still multiple ids listed with ;, then can best is (i) either to work with such 'groups' as is, or, (ii) if not possible for particular analysis, just discard such (rare) ids.

Best, Vadim

Gambrian commented 4 weeks ago

Dear Vadim

You mean Protein.Ids are all members of Protein.Group? Maybe I understood it correctly , but in fact, maybe in some projects this phenomenon is not "rare" but often happens, this is a result of a rat lung astral project, I used all uniport fasta (Swiss-Prot + TrEMBL) as reference fasta, 308334 out of 893821 direct results (just diann_load("report.tsv")) and 3720/9108 abundance matrix row names (diann_maxlfq(df, group.header="Protein.Group", id.header = "Precursor.Id", quantify.header = "Precursor.Normalised")) have ";",which makes me very confused when annotating them (GO KEGG) functions . I think the main reason for this is that there are indeed many similar proteins in the reference fasta. In another human data (only using Swiss-Prot fasta), this situation is indeed very rare, but it is indeed unavoidable in the study of some species. What do you recommend in this case?

In addition, in 1.9.2, I saw that you have made improvements to the normalization part. Does this mean that the results of diann 1.9.2 no longer need to be normalized to the total amount or use other normalization methods, and differential expression analysis can be performed directly?

image

By the way, the diann version I used is 1.9.1, I will test the performance of 1.9.2. Thanks again for the great software and your quick response

Best

vdemichev commented 4 weeks ago

but often happens

Are you using the default 'heuristic' protein inference?

What do you recommend in this case?

Make sure 'heuristic' option is used, also perform the analysis on the genes level.

In another human data (only using Swiss-Prot fasta), this situation is indeed very rare

Yes, this is expected. But on genes level it's the same.

Does this mean that the results of diann 1.9.2 no longer need to be normalized to the total amount

DIA-NN normalisation has always been quite good, if you re-normalise by the total amount, it will in most cases get worse. In 1.9.2 the normalisation is 'more precise'.

Gambrian commented 4 weeks ago

I used "heuristic" protein inference . Since I am still new to proteomics, I read through diann's user guide carefully and used most of the parameters you recommended, only the precursor parameters were suitable for our experiment. Here are the specific parameters for my two search processes. Did I make any mistakes in setting any parameters?

Thanks image

vdemichev commented 3 weeks ago

Looks good, but please fix the mass accuracies and the scan window. Precursor charge range 1 - 6 is usually a bad idea, 5+ precursors are so rare they are not worth searching, even 4 is often OK to ignore, same applies to 1.

Gambrian commented 3 weeks ago

Yes, in most cases, the distribution of charges is like this figure image But if my parameters look good, and the results of some studies show that there are many Protein.Group id with ";", when I want to do some functional annotation and differential expression, what do you suggest in this case?

vdemichev commented 3 weeks ago

when I want to do some functional annotation and differential expression, what do you suggest in this case?

You specifically want to do this at sequence ID level and not gene level? For example, GO enrichment is inherently gene-level.

Gambrian commented 3 weeks ago

Maybe I'm wrong but I thought most of the time the sequence IDs match the gene IDs so i think they will have many gene id like "gene_id1;gene_id2" and how do I determine which is the unique gene ID for a certain the Protein.Group with ";" Do you think these id in one Protein.Group will have same gene id (this is the real meaning of Protein inference:Genes)?

vdemichev commented 3 weeks ago

most of the time the sequence IDs match the gene IDs

Canonical human uniprot proteome is ~20k genes and ~70k sequences.

they will have many gene id like "gene_id1;gene_id2"

Very rarely.

Gambrian commented 3 weeks ago

Oh thanks, I will check and report back, the reason I annotated this by sequence id is that I downloaded fasta via uniport and got the annotation information via uniport, and all uniport results are based on accession id. Would it be a good idea to use entrez id if I merged them by gene id?

vdemichev commented 3 weeks ago

There's already 'Genes' and 'Genes.MaxLFQ' columns in DIA-NN report, so can use as is :)

Gambrian commented 3 weeks ago

If I still want to do functional annotation at sequence ID level (I don't really want to do that, just for understanding and discussion), should I set Protein Inference: Protein Name (from Fasta)?

vdemichev commented 3 weeks ago

In this case, best to set to isoform IDs.

dh2305 commented 3 weeks ago

I hope it is ok to hijack this thread but as Vadim pointed out to set the mass accuracies and scan windows. Why is this important if the software sets it itself or seems to readjust? Also with our TIMStof data, those values stay quite sameish but are not the same dataset to dataset and sample to sample. Shall we just use median numbers based on experience? are there good numbers for mass acc. and scanwindows on a bruker timstofHT?

also on this matter. Why is it important to do the lib generation and main search in samples in two steps? I've seen you mentioning that peptidoform scoring is off if not but that is only relevant for PTM or isoform searches and apparently this has been fixed with 1.9.2. Why is it still recommended to search in two steps?

And I have not observed big differences, or hardly any at all, if I separate the lib gen from fasta and the main search and in this case either put protein inference in the second step for the search in samples either again on gene (as for the fasta lib generation) or off (based on a comment in the documentation here). What is recommended here?

Thank you!

vdemichev commented 3 weeks ago

Please see https://github.com/vdemichev/DiaNN?tab=readme-ov-file#changing-default-settings for an overview.

Why is this important if the software sets it itself or seems to readjust?

These values must be identical for all runs for best quantification.

Also with our TIMStof data, those values stay quite sameish but are not the same dataset to dataset and sample to sample

On any timsTOF, mass accuracies can be safely set to 15ppm (if you set outside 10ppm-15ppm range, DIA-NN will print a warning). Scan window, on any instrument, should approx match the peak width (the whole, not FWHM) - but here it's very simple, you just see what DIA-NN sets automatically and then set this number.

Why is it important to do the lib generation and main search in samples in two steps?

The mode with combining these steps is simply not supported. Meaning we don't validate it and there are therefore no guarantees that it works correctly. So there's a 'strongly not recommended' warning in DIA-NN about this.

dh2305 commented 3 weeks ago

Thank you for your reply Vadim. I believe I speak for all user that this active conversation is truly great and such support is not seen in all MS software solutions.

1) Regarding mass accuracies and scan window. As the automatically set numbers vary between samples a little bit (always in the 10 to 15 range) I just take a mean and set this from now on until forever for this machine to fix it for quantification?

2) If I set protein inference to gene in the first step when I generate the library from the fasta file, do I set it to off in the search in the samples (step 2) or again to gene (or whatever was used in step 1 for lib gen)?

3) I have observed in 1.9.2 that in step 2 for phospho I need to provide a fast otherwise I will not get a site 90 and 99 output table? Is this by design? Shouldn't the information in the lib generated from the fasta suffice?

4) on the topic of phospho 4.1) Is there any reason not to set the search to ST via --var-mod UniMod:21,79.966331,ST? what is your stance on including Y (FDR/Y abundance situation and diff. in ID from MS2)? 4.2) We have seen with DIANN and other DDA and DIA software that many phosphopeptides are oxidised (>=10%). I acknowledge the search space and speed issue, but is there no loss of sites/peptides if M(Ox) is not used? I know M(Ox) is not recommended if I do not want to know spec. what the M(Ox) is doing biologically but I fear losses here. 4.3) A 2023 benchmark study utilizing synth. phospho peptides that has been referred to before has proposed that 0.75/0.99 site probability filters in Spectronaut (as well as MQ or Fragpipe) are comparable to 0.01/0.51 in DIA-NN. I know that there is no consensus on site prob. cut offs and confidence either on peptide or site level (sample specific or overall) is no trivial and nearly a philosophical question but what is your thoughts on this? 4.4) I have measured the same phospho samples twice each, same vial replicate, once in DDA and DIA. Fragpipe/MSFragger for DDA with the LFQ phospho workflow and Msboosts, Perculator and MBR on for 3 tech. replicates gives me around 12000 sites >75% site loc prob. (9900 no MBR and 9000 if 3/3 samples and site loc. pro >0.9 ) and 18000 phospho peptides for 100 µg (µphos workflow) of peptide input. DIA-NN at 90% for the same samples measured in DIA-PASEF (optimized with pydiAID) results in 6000 sites with MBR (4400 sites found in 3/3, 99% cutoff) and 11500 phosphopeptides (interestingly the same amount of overall peptides are identified in DIA-NN and MS fragger for DIA and DDA respectively). One would naively suppose that DIA would yield more IDs. Are those numbers reflecting the thorough, conservative and thought-through filtering and FDR-control approach of DIA-NN and your lab to report only confident sites, peptides, proteins etc.?

Thank you!

vdemichev commented 3 weeks ago
  1. That's OK. On timsTOF can just always set to 15ppm (we do that).
  2. Keep it same for both steps.
  3. Yes, because those matrices are specific to [sites in a protein], i.e. protein annotation is absolutely necessary. Normally, FASTA should always be specified. 4.1. ST is quite OK indeed I think. But 1.9.2 in ultra-fast mode is basically fine with large search spaces, so STY works just fine too. 4.2. You can try both and then compare the numbers of unique stripped sequences you get at peptidoform q-value < 1% - that would be the measure of the gain from enabling M(ox). About losses: we lose A LOT more just because sensitivity is just not enough, losses due to oxidation are negligible in comparison. 4.3. That was before 1.9, which has a different localisation algorithm. But in general it might be that DIA-NN is still quite conservative. On most good datasets and with MBR you can keep it quite high, that's why DIA-NN generates the phospho matrices at 0.9 and 0.99 cutoffs. 4.4. Which DIA-NN version? 1.9.2 should be very good at searching phosho.

100 µg (µphos workflow) of peptide input

This can overload timsTOF and yield bad ID numbers. What was the estimated amount of peptides (after enrichment) per injection and what timsTOF model it is?

dh2305 commented 3 weeks ago
  1. Thank you. Then we will just set mass ACC on MS1 on MS2 to 15. And the scan window radius to 8 as this is what we saw so far most often. This is fine?

4.1 We are choosing ST only as Y are quite different in their MS2 behaviour and so low abundant (naturally and/or regarding enrichment). We supposed that we run into a FDR problem if Y is included for phospho and I think this has been discussed in the phospho community. Do you have a preference for "most solid" phospho data if to include Y in the searches or not?

4.2 Yes, I will try with and without M(Ox). In one dataset with DIA-NN 1.9.1 and M(Ox) on we have observed that 10% of the resulting phosphopeptides have an M(Ox). 4.2.1 On this matter: N-term M-Exc. should always be turned on, also for phospho?

4.3 We only have so far measured techn. replicates of cell lysates. I know from experience with Fragpipe that the restoring and validation algorithms, and MBR, highly benefit from replicate numbers but also from biological variability. I guess this is the same for any machine learning/matrice size depending software such as DIA-NN? So I am excited to see what DIA-NN site ID numbers I will get in actual datasets with treatments/timepoints etc.

4.4. The numbers are with DIA-NN 1.9.2. I am just wondering how those numbers compare to your experience as DDAPasef with MSfragger seem to yield 2x more phosphopeptides/sites than DIAPasef and DIA-NN. I know comparison at different loc. prob. cut-offs is difficult and I have no idea what the benchmark or real number is or if the sites reported for DDA are any (more) real. Naively I would have guessed DIA and DIA-NN should yield more based on missing values/low abundant phospho peptides. Unfortunately with the site loc. prob. I have been unable so far to extract custom, site specific (and not overall lib. PTM confidence) filtered phospho tables at 0.75 etc. I have also not been able to successfully utilize the MSproteomics site reports python package to extract custom site and phosphopeptide tables from the raw data. Anyway, ultimately I am happy that both MSfragger for DDA and DIA-NN for DIA is extremely fast and I don't have to rely on MaxQuant. How 10000 sites in DDA and fragger as well as 5000 sites with DIA and DIA-NN compare I don't know and frankly don't care number speaking wise. I am interested if either way is better to catch biology.

Regarding peptides - this was used for the DDA acquisition as well (enriched twice and pooled after elution in the same vial so I can inject the same sample twice - once for DIA and once for DDA for comparison). 100 µg of starting protein material were digested and subsequently enriched for phosphopeptides with Ti-MOAC (µphos). I have no data on peptide concentration after enrichment and before injection but both the BPC and TIC are absolutely fine. An old version of this protocol (EasyPhos) utilizes 1 mg of starting material with the same amount of beads. Other recent ZrIMAC protocols (MagResyn HP) are also based upon 100-200 µg of digest input. We are using the HT upgraded TIMStof.

Thank you!

vdemichev commented 3 weeks ago
  1. Yes. But you can always adjust scan window depending on experiment (like if you use very different acquisition settings, makes sense to adjust). 4.1. There will be no FDR problem with DIA-NN. If it says it's Y with confidence, then it's Y :) I don't have specific preference. 4.2.

we have observed that 10% of the resulting phosphopeptides have an M(Ox).

More important is how many more unique sequences it yields. 4.2.1. Yes. 4.3. Yes! Btw, you can try diaTracer for phospho, wonder how it compares, i.e. if numbers also seem on low side with it? 4.4. TICs fine means probably OK.

dh2305 commented 3 weeks ago

Thank you again for your fast and insightful responses.

  1. I will explore diaTracer and report back. Do you have any preferences or recommended workflows inside MSfragger to allow for a "fair" representative comparison?

  2. Do you have any further comment on your experience comparing the performance of DDA/Fragpipe with DIA/DIA-NN for phospho?

vdemichev commented 3 weeks ago
  1. Don't know, I am yet to try it :)
  2. DIA-NN should be good :) If diaTracer shows much higher numbers than 1.9.2, I would then be grateful if you could share the .d folder (single run, zipped).
Gambrian commented 1 week ago

Dear Vadim

I spent some time testing the performance of 1.9.2 and the impact on protein inference( on 1.9.1), and wanted to report back and get some advice.

  1. I tested 1.9.2 and 1.9.1 on 6 studies (5 astral and 1 dia-4d), and in most studies, 1.9.2 identified slightly more proteins and took slightly less time (by the way, the spectral library size was reduced a lot, half of that of 1.9.1)
  2. I compared the quantification of 1.9.2 and 1.9.1, and the correlation of most proteins was higher than 0.8, but in some samples with relatively low protein content, such as paraffin sections and swab samples, and in some studies without good reference proteome data, the correlation of proteins may be lower. To further explore whether low-quality protein quantification results have lower correlation, I picked proteins with CV in the top 25%, and they did have higher correlation. But ① I still want to know how can I be sure that I have quantified my proteins well?
  3. Regarding the problem I mentioned before( https://github.com/vdemichev/DiaNN/issues/1224#issuecomment-2434174497 ), in some studies, about 1/3 of the proteome IDs are accession IDs composed of ";". And this is what I get by setting protein inference to "Protein Name (from fasta)", "Isoform ID", and "Gene"
diann_result_gene <- diann_load(file.path(work_dir_gene,"report.tsv"))
abundance_df_gene <- diann_maxlfq(diann_result_gene, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")

diann_result_isoform <- diann_load(file.path(work_dir_isoform,"report.tsv"))
abundance_df_isoform <- diann_maxlfq(diann_result_isoform, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")

diann_result_protein <- diann_load(file.path(work_dir_protein,"report.tsv"))
abundance_df_protein <- diann_maxlfq(diann_result_protein, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")

> ## gene
> str_detect(diann_result_gene$Protein.Group,";") %>% sum() 
[1] 408760
> str_detect(diann_result_gene$Protein.Group,"cRAP-") %>% sum() 
[1] 1546
> nrow(diann_result_gene)
[1] 1297780
> 
> str_detect(abundance_df_gene %>% rownames(),";") %>% sum() 
[1] 3130
> str_detect(rownames(abundance_df_gene),"cRAP-") %>% sum()
[1] 25
> nrow(abundance_df_gene)
[1] 8062
> 
> str_detect(diann_result_gene$Genes,";") %>% sum()
[1] 24190
> 
> ## isoform
> str_detect(diann_result_isoform$Protein.Group,";") %>% sum() 
[1] 376122
> str_detect(diann_result_isoform$Protein.Group,"cRAP-") %>% sum() 
[1] 3345
> nrow(diann_result_isoform)
[1] 1297780
> 
> str_detect(abundance_df_isoform %>% rownames(),";") %>% sum() 
[1] 3091
> str_detect(rownames(abundance_df_isoform),"cRAP-") %>% sum()
[1] 29
> nrow(abundance_df_isoform)
[1] 8170
> 
> str_detect(diann_result_isoform$Genes,";") %>% sum()
[1] 24574
> 
> ## protein name
> str_detect(diann_result_protein$Protein.Group,";") %>% sum() 
[1] 376122
> str_detect(diann_result_protein$Protein.Group,"cRAP-") %>% sum() 
[1] 3345
> nrow(diann_result_protein)
[1] 1297780
> 
> str_detect(abundance_df_protein %>% rownames(),";") %>% sum() 
[1] 3091
> str_detect(rownames(abundance_df_protein),"cRAP-") %>% sum()
[1] 29
> nrow(abundance_df_protein)
[1] 8170
> 
> str_detect(diann_result_protein$Genes,";") %>% sum()
[1] 24574

So setting protein inference to "Isoform ID" does reduce this, but the effect is not significant. As you predicted, most results do not have ";" in the gene column, so when there is no ";" in the gene column of this row, I want to keep the first access ID, and delete the row when there is ";" in the gene column to ensure that the results are annotated as much as possible using the uniprot functional annotation. ② Do you think this is the right way? ③ At the same time, there is another small question. Since diann has normalized the results, should I not use diann_maxlfq to obtain the expression matrix, but use diann_matrix to obtain the results directly?

Best