statisticalbiotechnology / maracluster

Matthew The's implementation of MaRaCluster
Apache License 2.0
11 stars 3 forks source link

More aggressive clustering #15

Closed bittremieux closed 5 years ago

bittremieux commented 5 years ago

I have some data generated with dynamic exclusion disabled, so a lot of spectra a very similar to each other.

To condense this data I tried running MaRaCluster. However, a lot of the different clusters still have extremely similar consensus spectra and seem to correspond to the same peptide. Yet, those spectra have not been merged into a single cluster.

See for example: image Each node is a consensus spectrum, with the width of the edges corresponding to the dot product similarity. You can see four cliques that should probably be represented by a single consensus spectrum, yet MaRaCluster hasn't merged these spectra.

How can I set MaRaCluster to be more aggressive and combine more spectra into clusters?

MatthewThe commented 5 years ago

Which p-value threshold are you using now (this is the number after "_p" in the cluster you're using)? To do more aggressive clustering you could try using the file with the most aggressive clustering, i.e "_p5" by default. If this is already the setting you're using, you can try using higher values of the threshold with the -t option, e.g. -4 or even -3, which would then also have to be added to the -c option.

There could still be other reasons why the spectra in question don't cluster such as variation in precursor m/z or predicted charge state. If you want I can have a closer look at the data myself to figure out why these spectra don't cluster.

bittremieux commented 5 years ago

I was indeed already using 10^-5 as p-value.

I've tried using higher p-values, running MaRaCluster like this:

maracluster batch -b files.txt -f . -c -5.0,-4.0,-3.0 -p 20ppm -C 1 -t -3.0 -o consensus.ms2 -M 3

Unfortunately now the consensus spectra are no longer generated, only the cluster files. I've tried to explicitly set the consensus output file using -o but that doesn't seem to change anything. When I was running MaRaCluster with the default arguments I got consensus MS2 files as well, i.e. MaRaCluster.consensus_p5.part1.ms2, etc. for the other p-value thresholds. How can I get those consensus spectra back?

Also, despite setting the minimum number of spectra per cluster to 3 I still see clusters containing only a single or two spectra in the cluster output file. This is not a big deal because I'm filtering those clusters myself afterwards, but it probably confirms I don't really know what's going on with the consensus cluster generation.

MatthewThe commented 5 years ago

Actually, the recommended way of generating consensus spectra is with the "maracluster consensus ..." command (instead of "maracluster batch ..."), where the cluster file is used as input. I'm actually surprised that it initially generated consensus files at all with the "batch" command.

bittremieux commented 5 years ago

Woops, my bad. Indeed, I forgot I'd used the consensus command to create those consensus spectra. I did the MaRaCluster analysis on another computer last week and I hadn't properly kept track of the different commands, so I was briefly confused.

The higher p-value thresholds seem to help as clusters that contain more spectra are created. There are still a few separate consensus spectra that I'd expect to be merged, but I'm taking of those in a post-processing step.

Thanks for the help.