statisticalbiotechnology / quandenser

QUANtification by Distillation for ENhanced Signals with Error Regulation
Apache License 2.0
9 stars 1 forks source link

Understanding feature groups output #8

Closed andrewjmc closed 4 years ago

andrewjmc commented 4 years ago

I couldn't find a simple description of feature groups output. I see blocks of output separated by blank lines, like

0 | 760.2199 | 1 | 7.449329 | 313840.7 | 0;0
0 | 760.2197 | 1 | 7.588816 | 439341.5 | 0;0.0185841322
0 | 760.2199 | 1 | 7.771616 | 174641.2 | 0;0.0204743743
0 | 760.2198 | 1 | 8.263292 | 260254.3 | 0;0.0293297768
0 | 760.2203 | 1 | 8.505181 | 527822.9 | 0;0.0179446936
0 | 760.2204 | 1 | 8.968894 | 105197.4 | 0;0.0318185091
1 | 760.22 | 1 | 7.190035 | 135357.4 | 0;0.00236356258
1 | 760.2203 | 1 | 7.93605 | 131749 | 0;0.0151944757
2 | 760.2197 | 1 | 8.802806 | 96599.77 | 0;0.00802779198
3 | 760.2202 | 1 | 7.846471 | 173482 | 0;0.0148342848
4 | 760.2202 | 1 | 6.97418 | 228263.7 | 0;0.0267062187
4 | 760.22 | 1 | 8.484617 | 273778.5 | 0;0.0189425945

I assume each block is a feature matched between samples. Does each row correspond to a single identification of the feature on MS1?

The first column is clearly sample number, the second column m/z values, the fourth is retention time (calibrated or uncalibrated?), and the fifth, I presume is intensity. I am unsure of the meaning of the third or final columns.

If I've missed documentation, please point me in the right direction!

Thanks,

Andrew

MatthewThe commented 4 years ago

The documentation is indeed lacking at the moment, I'll try to add some soon.

Each block is indeed a group of matched features, each row is structured as: file_id <tab> prec m/z <tab> charge <tab> rtime_uncalibrated <tab> intensity <tab> MS2_spectrum1;matching_error_probability,MS2_spectrum2;matching_error_probability

If MS2_spectrum=0, as is the case in your example, no MS2 spectrum was available/assigned to this group of MS1 features. Multiple MS2 spectra can be assigned to a single group of features, which often happens when there are other peptide species in the isolation window (which are potential chimeric spectra).

andrewjmc commented 4 years ago

Very helpful, thanks for quick response. To flesh it out further...

1) There are often multiple rows per feature and RAW file - does this correspond to multiple (consecutive?) MS1 scans, or multiple MS2s? 2) If the rows correspond to MS2s, how are MS1 features that were never fragmented handled (I will still be interested in MS1 features unfragmented across all samples, as they could be targeted in future runs) 3) The MS2 spectrum number(s) - is this the number of the corresponding MS2 consensus spectrum? 4) Why is there a matching error probability for each MS2, and also when there is no MS2. I note that where there are two MS2s for a row, sometimes the error probabilities are identical to many decimal places, and sometimes distinct. Is this to be expected?

Thanks again. My excitement for this tool is growing!

Andrew

MatthewThe commented 4 years ago
  1. Mostly, this is because the feature finder sometimes breaks up a peak into several MS1 features, for example when the chromatographic shape is not very clear. In some cases, it can also be the result of the clustering algorithm accidentally grouping together multiple peptide species.

  2. The rows actually correspond to MS1 features, the last column just includes the MS2 spectra that are most likely associated to this feature. The MS1 features that were never fragmented can be recognized by only having the MS2 spectrum identifier 0, as in your original example.

  3. Yes, that's correct. You should be able to find this identifier in the consensus spectrum file.

  4. This part is actually rather complicated, but I'll try my best to explain it. The matching probability for the MS2 spectrum actually refers to the feature-feature match error probability. The idea is that if we have an MS1 feature in run_1 which was matched to an MS1 feature in run_2, for which only the MS1 feature in run_1 had an MS2 spectrum, we associate this MS2 spectrum to run_2 as well, but with the error probability that the MS1 feature in run_1 is correctly matched to the MS1 feature in run_2. At the same time, we associate the MS2 spectrum to the feature in run_1 with probability 0 (this is technically incorrect, but as of now, we do not have a good way to estimate this error rate). This also explains why some of the probabilities are exactly the same, as it might be the same set of MS1 features that were matched. If this is still unclear, I can draw you a diagram that would hopefully make things a bit clearer.

andrewjmc commented 4 years ago

For the moment this is perfect! I am about to launch a big run on the HPC and look forward to the output!

Thanks for your help,

Andrew