statisticalbiotechnology / quandenser

QUANtification by Distillation for ENhanced Signals with Error Regulation
Apache License 2.0
9 stars 1 forks source link

Direct correspondence of MS2s to features #19

Closed andrewjmc closed 2 years ago

andrewjmc commented 4 years ago

Hello,

I am finding great utility in utilising the output of quandenser, particularly by combining with existing MS2 identifications from other tools.

I note that PEPs are used to quantify confidence in the correspondence between features and individual MS2s. Is there any way from the quandenser output to know the direct MS2 to feature correspondence (i.e. the feature corresponding to the precursor that was selected for fragmentation)?

Thanks,

Andrew

MatthewThe commented 4 years ago

There is no easy way to do this, but it can be done by finding the file+scan number in the cluster output in the clusters_p10 file in the maracluster folder, take the consensus scan number from the third column and then find the features in the feature_groups.tsv file with that consensus scan number. In this file, the consensus scan number will be multiplied by 100 and an index will be added, e.g. for scan number 1234, 123402 will be one of the consensus scan numbers in that file.

andrewjmc commented 4 years ago

Thanks Matthew, I think I've got the hang of that bit already. However, this gives me the correspondences with p values, and these could even be from an MS1 in one sample to an MS2 in another. I'm interested in picking out the (narrower) direct matching from a sample's MS1 feature to that same sample's MS2. I guess this might be harder

MatthewThe commented 4 years ago

That's indeed a bit harder.. If I remember correctly, this information is stored in the binary file maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat. I might be able to convert that back to something human readable. I'll have a look tomorrow.

MatthewThe commented 4 years ago

I created a python script that can convert the binary file maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat: convert_spectrum_to_precursor_map.zip

You can run it as:

python convert_spectrum_to_precursor_map.py maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat maracluster_extra_features/Quandenser.spectrum_to_precursor_map.tsv

The important columns are:

andrewjmc commented 4 years ago

Perfect, thanks. Look forward to trying this out!

Andrew

andrewjmc commented 4 years ago

Just to note I don't have a maracluster_extra_features/Quandenser.spectrum_to_precursor_map.dat only maracluster/Quandenser.spectrum_to_precursor_map.dat. Is this OK?

MatthewThe commented 4 years ago

Ah, that is probably because you ran that part using standalone MaRaCluster. It will work with the maracluster/Quandenser.spectrum_to_precursor_map.dat and will give all the correspondences to features discovered in the initial Dinosaur runs, but not to features that were found by the targeted Dinosaur runs.

If you also want the latter correspondences, you need to run the second MaRaCluster part within Quandenser by removing the maracluster_extra_features folder and re-running Quandenser. Make sure to back up the previous results such as the Quandenser.feature_groups.tsv and the consensus_spectra folder, otherwise Quandenser might try to overwrite them.

andrewjmc commented 4 years ago

OK, make sense.

The script worked fine, thanks. Even before I get to using the data, I'm looking at the PEP-based feature to MS2 correspondence from the feature groups file. Is it a surprise that 55% of MS2 clusters (and 67% of MS2s) were never assigned to a feature?

MatthewThe commented 4 years ago

That seems suspicious, how did you calculate these numbers?

All MS2 clusters should be mapped to at least one feature group. With the Python script, I could map 81% (maracluster) and 84% (maracluster_extra_features) of the MS2 spectra to an MS1 feature.

andrewjmc commented 4 years ago

In my R script, I read in the feature groups in blocks (to save memory). I summarise the feature groups into single rows and aggregate the assigned MS2s into comma separated lists. I then produce a separate lookup table with columns for MS2 cluster, MS2 index and feature group.

I separately read in the MS2 clusters and also aggregate separately into single rows per cluster.

I have 9.4 million MS2s, and 4.9 million clusters. 3.1 million MS2s and 2.2 million clusters are assigned to features according to my analysis.

The spectrum to precursor map (initial dinosaur runs) has 9.5 million rows, but some MS2s are duplicated (putative chimeric spectra?), leaving 6.5 million unique MS2s assigned to features. This is still higher than what I get above.

Perhaps I have done something wrong in my analysis - I will look again.

UPDATE think I've spotted the issue - I may have just been taking the MS2 assignments from the first feature of each feature group. The stats will probably improve when I change this.

andrewjmc commented 4 years ago

I've updated the numbers and have some questions. Using the PEP-based assignments from every feature changed nothing. I now realise that's because it's always the same list of MS2s that are assigned to each feature within a feature group, only with different PEPs. Also, if I simply grep the features from the feature groups file, I find the same numbers, reassuringly!

Looking at the direct assignments from the spectrum to precursor map (maracluster_extra_features), I have 15 million mappings, meaning some MS2s are mapped to multiple features. Some features, unsurprisingly, are mapped to multiple MS2s. There are 6.8 million unique MS2s which are linked to features, which is a much better proportion, but still leaves about 3 million "orphaned" MS2s. What might explain these?

Why might it be that the proportions of MS2s and MS2 clusters assigned by PEP (33 and 45% respectively) are low compared to the proportion of MS2s assigned directly (72%). Is it the case that no more than one MS2 from each cluster is assigned by PEP?

Turning things around and looking at feature groups, 630,000 of 3 million feature groups have at least one MS2 directly linked to a feature, whereas just over 1 million do when using the FDR-controlled linkage. I haven't yet worked out why the FDR-controlled linkage should increase the number of feature groups with an MS2 assigned. Do you have an explanation?

Sorry for the excess of questions, but grateful for your thoughts!

Best wishes.

MatthewThe commented 4 years ago

Sorry for the slow reply, I had to find some time to properly think about this.

There are 6.8 million unique MS2s which are linked to features, which is a much better proportion, but still leaves about 3 million "orphaned" MS2s. What might explain these?

The number of "orphaned" MS2s is rather high in your case, but we do expect some of them from (1) MS2 spectra that are triggered by noise peaks in MS1, (2) MS2 spectra that are not from peptides (as their isotope patterns might differ) and (3) as a result of regions where feature calling is difficult, e.g. when a lot of overlapping isotope patterns confound the feature caller. You could check if there are certain retention time regions or certain runs that have a higher rate of orphaned MS2s.

Why might it be that the proportions of MS2s and MS2 clusters assigned by PEP (33 and 45% respectively) are low compared to the proportion of MS2s assigned directly (72%). Is it the case that no more than one MS2 from each cluster is assigned by PEP?

As I mentioned before, all MS2 clusters should be assigned to at least one feature group. Perhaps I'm missing a step, but non-clustered MS2s are never assigned by PEP to a feature group, only MS2 clusters.

I haven't yet worked out why the FDR-controlled linkage should increase the number of feature groups with an MS2 assigned. Do you have an explanation?

What do you mean exactly with the "FDR-controlled linkage" and how does it differ from the PEP assignments?

If the files are not too big, could you share your Quandenser.feature_groups.tsv, the converted spectrum_to_precursor_map.dat and maracluster_extra_features/Quandenser.clusters_p10.tsv?

andrewjmc commented 4 years ago

Really helpful, thanks.

The number of "orphaned" MS2s is rather high in your case, but we do expect some of them from (1) MS2 spectra that are triggered by noise peaks in MS1, (2) MS2 spectra that are not from peptides (as their isotope patterns might differ) and (3) as a result of regions where feature calling is difficult, e.g. when a lot of overlapping isotope patterns confound the feature caller. You could check if there are certain retention time regions or certain runs that have a higher rate of orphaned MS2s.

I've looked into the orphaned spectra, and they are concentrated in the samples from lab 1 (~50% orphaned vs 20%). Interestingly lab 1 loaded much more sample, so perhaps this causes the issue with overloading the feature caller? Do you have any pointers for investigating this more, and is there any way to improve this?

As I mentioned before, all MS2 clusters should be assigned to at least one feature group. Perhaps I'm missing a step, but non-clustered MS2s are never assigned by PEP to a feature group, only MS2 clusters.

I had neglected this, and thought that singletons were also assigned. Is there a reason for dropping the singletons? That said, looking at the lookup table I made, I find that nearly half of assigned MS2s are singletons, and I have checked (for one example) that this is true in the original files.

Yes, I meant the PEP assignments. I'm surprised that my analysis finds 600,000 feature groups have at least one MS2 directly assigned, but over a million have MS2s assigned by PEP. I can't work out how a feature group without a directly assigned MS2 can have an MS2 assigned by PEP. It is always possible I've done something wrong!

I would be happy to share, but it's 28 Gb total! I'm not sure if the college's Box service will accept files of that size.

MatthewThe commented 4 years ago

Do you have any pointers for investigating this more, and is there any way to improve this?

One thing would be to check if we can actually identify these orphaned MS2 spectra. If we can, then perhaps we could try a different feature calling software.

I had neglected this, and thought that singletons were also assigned. Is there a reason for dropping the singletons? That said, looking at the lookup table I made, I find that nearly half of assigned MS2s are singletons, and I have checked (for one example) that this is true in the original files.

Sorry, I did not formulate this well, singleton clusters are definitely assigned. What I meant was that the original MS2 spectra identifiers are never assigned to feature groups, so I was confused that you assigned these using PEPs rather than directly from the spectrum_to_precursor_map output (which does not contain PEPs), but maybe I just misunderstood what you did.

I can't work out how a feature group without a directly assigned MS2 can have an MS2 assigned by PEP. It is always possible I've done something wrong!

This should indeed not happen, in fact this number should be the same. Could you perhaps find an example where this happens.

I would be happy to share, but it's 28 Gb total! I'm not sure if the college's Box service will accept files of that size.

Ah, I see. As these are text files, compressing them should reduce the size quite drastically. How big are each of the files by themselves?

andrewjmc commented 4 years ago

Thanks for clarifying - I now understand that the linkage is solely to MS2 clusters. I will ignore the spectrum assignment, which I presume is spurious.

The statistics now almost agree. 45.8% of clusters have at least one MS2 directly assigned to a feature. Only a few clusters are missed by PEP by my calculation (65,365, 1.3%). I have confirmed by searching the feature groups TSV for one example (just to make sure it's not an error in my program).

Thanks!

andrewjmc commented 4 years ago

I've returned to being a little puzzled by the discrepancy in feature groups with MS2s assigned by PEP, and features with MS2s linked directly by the spectrum_to_precursor_map. I presently have about 800,000 feature groups with MS2 clusters assigned by PEP and ~1.3 million with MS2s assigned directly. I tested the assumption that each individual MS2 should only be assigned to one feature in a sample, but this is not true.

For example I have an MS2 which is directly assigned to 14 MS1 features from the sample with similar retention times, and precMz between 1110.000 and 1111.467, a difference of 1300 ppm. I would understand multiple assignment if the precMz were within a very narrow tolerance and it was unclear which feature the MS2 corresponded to. However, these are FTIT orbitrap data, and the default tolerance of quandenser is 20 ppm.

If I take the MS2 cluster to which this MS2 belongs, it is assigned to only a single of the 14 feature groups corresponding to the direct assignment.

Grateful for your advice.

MatthewThe commented 4 years ago

This is likely the effect of the "intensity score filter", which selects the "best" feature group per MS2 cluster. The intensity score filter tries to select the feature group(s) with the fewest missing values and highest intensity values.

One MS2 spectrum can indeed be assigned to many MS1 features, this is done based on the isolation window as recorded in the mzML file. This isolation window is usually 1-2Da, which is what explains the 1300 ppm difference you see. By also taking into account other MS1 features in the isolation window, we can identify chimeric spectra and assign them to the correct MS1 feature.

Hope that helps!

andrewjmc commented 2 years ago

I realise I never responded to your very useful comment here. I had never quite appreciated how wide the isolation window can be on an Orbitrap. No wonder chimeric spectra are common!