vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
283 stars 53 forks source link

Custom spectral library from skyline #1214

Open cutleraging opened 1 month ago

cutleraging commented 1 month ago

Hi Vadim and team,

I have generated a spectral library in skyline which I have exported as a report to be used in DIAnn. The reason for doing this is that I am analyzing histone PTMs and am interested in certain histone peptidoforms. Essentially this is an effort to reduce the search space. Therefore, I have set the proteins to be the peptidoforms, as I am interested in quantifying those. I have attached the report generated from skyline which I use as the input for DIAnn as well as the log which contains the command used to run DIAnn and the results. When identifying histone peptidoforms, it is quite important that the retention times be taken into account as well as the ion mobility (especially for isobaric forms). I have some questions that have come up when doing this analysis that I would appreciate your help on.

skyline spectral library.csv

report.log.txt

report.csv

1) When I search using this custom spectral library, I want to make sure that the retention times are being taken into account when searching for the histone peptidoforms. However, in the results, I can see that the identifications are not at the expected retention time (comparing to retention times in spectral library). I see that in the spectral library that DiaNN built, there are values in the Tr_recalibrated column. How can I ensure that DiaNN takes into account the RT in the spectral library? Is there a setting for an RT window or something like this?

2) Even though I include IonMobility column in the spectral library, in the spectral library that DiaNN built, all the values are 0. Why is this?

3) The LibraryIntensity that I provided in the spectral library was the MS/MS peak intensity of the corresponding product. I see these are all values between 0-1. Is this normal?

4) As I mentioned, I am interested in quantifying at the modified peptide level. In my spectral library, I provide the ModifiedPeptide and then in the ProteinName column, I provide a corresponding name for the modified peptide (e.g. H1.5-K33[un];K45[un];K51[un]). However, in the results, the Protein.Names column only contains a single name. How can I have it so that the Protein.Names column in the results corresponds to the ProteinName column in the spectral library? I want to do this so that I can use the values in the PG.Quantity column.

5) I have been varying the qvalue of my searches with the spectral library to see how this effects the results. (e.g. 50%). I find that when relaxing it, I get many more of the IDs that I expect, although at a cost. I notice that some of these new IDs are not at the expected retention time. Is there a way for me to determine the optimal qvalue for my situation? Also, what's the difference between precursor-level q-value and run-specific protein q-value?

6) Some of the histone peptidoforms are isobaric, and don't contain a unique fragment (although they do differ if looking at combinations of fragments). Should "No shared spectra" be enabled?

7) My samples are both label free single cells and bulk. Should the bulk samples be included in the searches with MBR enabled to boost the IDs in the single cells?

Thanks a lot! Ronnie

vdemichev commented 1 month ago

Hi Ronnie,

DIA-NN calculates FDR roughly as number of decoy hits divided by the number of target hits. Since you have just 70 target precursor searched, the estimated FDR normally will not go below 1/70, i.e. will never reach 1%. So not going to work like this.

Suggestion:

  1. I can see that the identifications are not at the expected retention time (comparing to retention times in spectral library). DIA-NN will always align automatically the library RTs to the RT in the specific DIA experiment. Scales do not need to be matched, i.e. you can use a library based on 120-min gradient to analyse data acquired with 3-min gradient.
  2. All calibrations just fail because too few peptides detected, i.e. need to search background proteome too.
  3. Can be any scale here, does not matter for DIA-NN.
  4. Please add a Protein.Ids column to the spectral library and the info there will correspond to the Protein.Ids column in the DIA-NN report.
  5. Any value 0.01 - 0.05 is fine for an MBR search, provided the final output is filtered at Q.Value <= 0.01 and, in your case, Peptidoform.Q.Value <= 0.01 and potentially PEP <= 0.01. Protein q-values indicate the proportion of falsely identified proteins (not precursors), i.e. filtering just based on a precursor q-value will result in higher FDR for proteins then the filter threshold used, hence the need for a specific protein q-value.
  6. No shared spectra should always be enabled. If there are no specific fragments, then how do you (or Skyline) distinguish between them for the inclusion in the library?
  7. Yes, definitely.

Best, Vadim

cutleraging commented 3 weeks ago

Hi Vadim,

I tried to implement your advice to improve the searches. My samples are whole cell lysate so there is a good amount of background. Here is what I did...

  1. Created a spectral library of the background proteome by searching the samples in fragpipe

    • Removed entries corresponding to histone peptides library.csv
  2. Combined this with my spectral library of histone peptidoforms

    • Changed the retention time values to be on a scale of 0-100 (to match the background proteome spectral library).
    • Changed the peptidoform names to ProteinID
    • Added proteotypic information for histone peptidoforms - assigned as 1 - is this correct to do? histone + background proteome - spectral library.csv
  3. Searched this with the following settings

    • Now including 2 additional modifications that were found in background proteome
    • FDR=0.01
    • Is the library header correct here? report.log.txt
  4. Which resulted in this (removed some rows to fit 25MB file size) report-subset.csv

With this, I only was able to detect 11/69 of the histone peptidoforms. Any ideas why this could be? Ideally I want to get it so that DIANN can quantify >90% of the histone peptidoforms which I have already found manually in skyline.

Some additional questions

Thanks so much, Ronnie

vdemichev commented 3 weeks ago

Hi Ronnie,

Changed the retention time values to be on a scale of 0-100 (to match the background proteome spectral library)

If this was done incorrectly, it will severly reduce IDs of modified histone peptides.

Added proteotypic information for histone peptidoforms - assigned as 1 - is this correct to do?

Is there any purpose in using the peptidoform ID as protein ID instead of the real protein ID? It makes sense that it causes those warnings you mentioned printed by DIA-NN.

I suspect that some of the histone peptidoforms are not being found because of how the retention time window is being set. Is it possible to manually set this, say to something like 5 min? Is this the --window parameter?

Indeed, that would make sense. Try starting with --im-window 0.1 and --rt-window set to a 5th of the gradient length. and then see if increasing either further helps. Note that these settings will only be necessary for the first-pass of MBR, so better do the first pass separately and then just use the refined lib with normal settings. But most importantly, please try to align the library RTs and IMs of histone peptides with the rest of the peptides with something like loess, based on the search with wide IM and RT windows, for example.

I also include negative control (empty wells) in the search. Will this hurt performance in any way?

It might hurt 'a tiny bit' quantification and normalisation.

Best, Vadim