statisticalbiotechnology / representative-spectra-benchmark

Analysis of different consensus spectrum construction methods
Apache License 2.0
5 stars 11 forks source link

Discussion about results #56

Open ypriverol opened 3 years ago

ypriverol commented 3 years ago

Hi @all: We have put one student analyzing the data to finish the project 🚀. We have made some major advances in the data analysis included in the following PR #55 in the results folder excel and figures. Here some details about what we want to do in terms of analysis:

We have found some issues we need to solve before from now on related to clustering that probably @jgriss can comment on.

Some minor results shows:

ypriverol commented 3 years ago

We have found the issue with PRIDE Cluster generation of clusters. We will update on this, this week.

jgriss commented 3 years ago

@ypriverol Thanks for the update! Was just starting to set up the tests!

Anything I can fix in the spectra-cluster code?

jgriss commented 3 years ago

@ypriverol The results are in-line with what's in the literature. If I'm not mistaken, we even had this in our first paper in 2012.

The theory back then was that a consensus spectrum always contains some noise and will therefore never be as good as the best measured spectrum in the dataset.

Question is whether we can use this to create an even better consensus algorithm

ypriverol commented 3 years ago

@jgriss we have found the right PRIDE Cluster threshold to produce the same amount of clusters between MaRA cluster and PRIDE cluster. I will update in the issue this week.

ypriverol commented 3 years ago

@jgriss @percolator :

We have the first results of the data analysis of consensus clustering. The idea of this research is to compare different methods of consensus clustering through peptide identifications.

We found out small differences between PRIDE Cluster and MARa cluster. I really would like to remove one of them @percolator because any reviewer will go in the direction of the clustering results rather than the consensus spectra generation. I don't mind to make the comparison with MaRA rather than PRIDE cluster. Bit I don't really want to make the comparisons about clustering algorithms. Original results can be seen here: https://github.com/ypriverol/specpride/blob/dev/results/discussion.adoc

What do you think? @percolator @jgriss

jgriss commented 3 years ago

Hi @ypriverol ,

I'm just looking through the clustering nextflow workflow: In the current version in the repo, there are no arguments passed to MaRaCluster.

The default precursor tolerance, for example, is 20 ppm for MaRaCluster while you set 10 ppm for spectra-cluster.

Also, I failed to find the step where you merge the clustering results with the MGF files. Which MaRaCluster threshold are you using?

Finally, both MaRaCluster and spectra-cluster have inbuilt consensus spectra algorithms. How about evaluating them as well?

Kind regards, Johannes

ypriverol commented 3 years ago

After a brief discussion today with @percolator and @jgriss we decided the following things:

bittremieux commented 3 years ago

I have recently compared a few clustering tools, including spectra-cluster and MaRaCluster: https://www.biorxiv.org/content/10.1101/2021.02.05.429957v2

image

Both of these can generate a very high number of small clusters compared to other tools (Figure 2). This is an important aspect to keep in mind. For example, spectra-cluster and MaRaCluster might split large clusters corresponding to the same peptide into several small clusters more often than other clustering tools. As a result, searching the clustered data will take longer while the number of unique peptides that can be identified should be similar.

Rather than trying to get the same number of clusters out of each tool, I think it could be relevant to get a clustering result from different tools with a certain number of incorrectly clustered spectra, as in my evaluation. Next, as you already suggest, I thinmk it's a good idea to not approach by contrasting different tools to each other, but rather to evaluate which representative strategy works best for each tool.

I still think it's valuable to include multiple tools, not only MaRaCluster. A single tool might do something funky or have some special properties. By getting a trend for multiple tools we'll have less "overfitting" and get a more generalizable insight.

Different tools produce clusters with different characteristics that may be suited for alternative downstream applications. There likely won't be a single best answer. Instead, learning general trends of what works well and what doesn't will be valuable, without having to explicitly compare different tools to each other.

bittremieux commented 3 years ago

The analysis above was performed on the Kim draft human proteome, which consists of ~25M spectra. I already have clustering results from all these different tools on that dataset, so possibly we might be able to re-use those if that fits as a bigger dataset.

The clustering data is actually also available here: https://zenodo.org/record/4721496

jgriss commented 3 years ago

@bittremieux Thanks a lot for sharing! That's a very nice benchmark!

@ypriverol I very much like @bittremieux 's idea to use this results as a basis for the consensus spectrum test. The dataset is well known and sufficiently large.

Would it be possible to adapt the pipeline to use these results as well?

percolator commented 3 years ago

I agree that these are really nice plots, Wout!

My worry is just that it might be hard to get through with a message if we discuss results from several clustering methods. I discussed this some time ago with Yasset, and we came to the conclusion that we might include multiple clustering tools, but that we should strive to keep such results in the supplement unless the results are very consistent.

We used the Zolg et al, set in our benchmark mostly to know that we have an answer to which peptide is behind each cluster. This is not the case for the Kim et al. set. Is there a constructive way for what we would see as a correct result for Kim et al?

I all honesty, the approach with the Zolg et al. has an inherent problem in that Yassets tests produce lower unique peptide identification rates with clustering than without clustering. That is likely not the case for a larger set, like the Kim et al. set.

Yours --Lukas

On Fri, Apr 30, 2021 at 10:48 AM Johannes Griss @.***> wrote:

@bittremieux https://github.com/bittremieux Thanks a lot for sharing! That's a very nice benchmark!

@ypriverol https://github.com/ypriverol I very much like @bittremieux https://github.com/bittremieux 's idea to use this results as a basis for the consensus spectrum test. The dataset is well known and sufficiently large.

Would it be possible to adapt the pipeline to use these results as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statisticalbiotechnology/representative-spectra-benchmark/issues/56#issuecomment-829945184, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXKAD3A4L7WFNZ6Y2OLT3TLJVHDANCNFSM4ZNFC6SA .

jgriss commented 3 years ago

Hi Lukas,

One aspect that should maybe be discussed is that based on Wout's results, both msCRUSH and falcon create large clusters than our tools. This could improve the consensus spectrum quality.

How about keeping the dataset but adding falcon and msCRUSH? If the results are very similar, everything can go into supplementary. But if they are not, they could point into the right direction.

Kind regards, Johannes

bittremieux commented 3 years ago

[...] we came to the conclusion that we might include multiple clustering tools, but that we should strive to keep such results in the supplement unless the results are very consistent.

We seem to be pretty much in agreement. It's a valid concern that we don't want a competition between different clustering tools. However, it's my hope that by having results from multiple tools, a general trend will emerge. That would make for a stronger message than just having a single tool, with its particular idiosyncrasies and cluster characteristics. And a key message will probably be that there is no single "best" tool, but that different tools produce different results that might be more or less applicable to different use cases.

I'm not the biggest fan of using the ProteomeTools dataset, because it's not a realistically complex sample. Unfortunately, of course with the Kim dataset (and other biological datasets), there is no ground truth. It's maybe a bit inconvenient, but I think that using peptide identifications as a proxy for the ground truth should be fine. After all, that's what the field has been doing for decades. There will indeed be some noise in the labels. When performing the above benchmark I inspected several clusters manually, and some "incorrectly" clustered spectra have highly unlikely dissimilar peptide assignments that seemed a scoring artifact rather than really different spectra. However, this is likely a similar problem for all clustering tools, and should presumably not favor one tool over another.

Let me see if I can find a bit of time this weekend to export representative spectra for the clustering results with the different tools that I already have. If it works, we might be able to relatively quickly check what the results look like and decide how to frame the story based on that.

In the figure above I only considered clusters with at least two spectra as valid clusters (otherwise you're always able to cluster 100% of the data of course). I guess for this analysis we also want to include singleton clusters, because the goal is to maximize the number of identifications?

bittremieux commented 3 years ago

@percolator Do you mind giving me direct commit rights to this repository? Thanks.

ypriverol commented 3 years ago

First of all, thanks @bittremieux @jgriss and @percolator for the discussion. Some thoughts here:

1- First, we will use multiple clustering algorithms as we originally agreed on @bittremieux @jgriss @percolator. However, within the main manuscript, we will discuss one clustering algorithm and all the other tools can be moved to the Suppl. The major reason is that we don't want to go toward clustering benchmarking, we have done it already, multiple papers have done it as well. But, I like the idea that consensus algorithms that work better in combination with clustering algorithms.

2- Currently, the student is testing a 1M MS/MS dataset besides the existing two datasets already tested. While I like the idea of testing the 25M spectra, in the current manuscript we always compare with the original id numbers (without clustering), do you have those numbers @bittremieux ?

3- The student is now running all the data again with the identification pipeline because we have some inconsistencies with the ids and MSGF+. I will update at the end of the week the results.

bittremieux commented 3 years ago

Yes, the identifications I'm using are from the MassIVE reanalysis of the draft human proteome dataset (RMSV000000091.3). That's also what I used for the comparisons in the figure above.

jgriss commented 3 years ago

Hi everyone,

I very much agree with @ypriverol and @percolator that we should be careful not to let the focus move on the clustering algorithms. Nevertheless, the size and purity of clusters may be an important factor in the quality of consensus spectra.

In order to strengthen the focus on the consensus algorithm, why don't we add a completely artificial dataset where the consensus spectrum is calculated based on a certain portion of spectra that were identified as the same peptide? This fraction could be changed in order to create "clusters" with different sizes. This could allow us to explicitly study the phenomenon of cluster size without using any clustering algorithm.

I don't suggest this should be the only comparison, but an additional one to what's already planned.

Kind regards, Johannes