New package for LC-MS/metabolomics feature grouping?

jorainer commented 4 years ago

We've implemented some functionality that allows to group LC-MS features (i.e. defined by their m/z and retention time range) into feature group where ideally each feature group collects all features that come from the same original compound (i.e. are adducts or isotopes ... of that). The question now is what should be the home of this functionality. Currently 3 main grouping functions are available:

group features based on similar retention time
group features based on high correlation of feature abundances across samples
group features based on similar peak shape of the EIC

The LC-MS feature grouping functions above re-use functionality from xcms (especially for the correlation based on peak shape of the EIC).

The general workflow would be the following: xcms (peak picking, feature definitions) -> ??? (initial feature grouping based on properties of the LC-MS) -> Features (optional further feature grouping, independent of LC-MS).

The question now is where should be put the functionality. Possible options are:

1) put it into xcms 2) put it into Features 3) put it into a new package MetaboFeatures or xcmsFeatures.

Would be nice to get some feedback @lgatto @sneumann @stanstrup @michaelwitting @sgibb

michaelwitting commented 4 years ago

I would vote for number 3: xcmsFeatures.

lgatto commented 4 years ago

It would be useful to have a look at the package.

jorainer commented 4 years ago

The package is not there yet - I've implemented the respective functionality here at the moment: https://github.com/EuracBiomedicalResearch/CompMetaboTools/blob/master/R/group_feature_methods.R

We could also make a dev call, but if so, it has to be either this week or end of July.

lgatto commented 4 years ago

Thanks - so this is what you showed me some time ago during a call, isn't it?

It doesn't look it fits in Features at all (for now at least).
I wouldn't mind a call, but not this week.
I would prefer to not put in in xcms because I also see an application in proteomics.
I think option 3 seems the right approach. I would prefer not xcmsFeatures (see above and start with a lower case ;-), and only lukewarm with MetabFeatures (see above). What about MSFeatures? And may be the current Features could be renamed QFeatures (for Quantitative/Quantitation), to keep away from the ambiguous features term.
Are there any plans to eventually convert the mass spec features into quantitative tables using Features?

jorainer commented 4 years ago

Thanks - so this is what you showed me some time ago during a call, isn't it?

yes, exactly.

It doesn't look it fits in Features at all (for now at least).

That's how I also see it at the moment - but its results should then be exported/converted to a Features object.

I wouldn't mind a call, but not this week.

Maybe anyway better to do a call once I've also finished a vignette.

I would prefer to not put in in xcms because I also see an application in proteomics.

OK for me.

Regarding the name, I agree, Metabo might not be the correct thing here - actually, the most appropriate name for the features we're dealing with would be LC-MS features - but LCMSFeatures looks terrible. I would be OK with MsFeatures.

Are there any plans to eventually convert the mass spec features into quantitative tables using Features?

Absolutely. The plan is xcms -> ??Features (do feature grouping based on LC-MS properties) -> Features (do some further grouping based on difference in m/z, or MS2 spectra or ...). I see the ??Features more as a package that depends on Features, reuses it's classes and functionality but adds additional functionality that we need for untargeted LC-MS(/MS) (which might also, at least partially, be interesting for proteomics).

lgatto commented 4 years ago

It looks like we are on the same page. Let's schedule a call after our respective holidays.

stanstrup commented 4 years ago

I am not exactly following @lgatto's reason for not liking xcmsFeatures and how it relates to proteomics, but that is probably my ignorance of proteomics.

But to me it is also illogical. An "xcms Features", i.e. thought of as either a peak in two dimensions (as we generally do in metabolomics, right?) or as a feature generated by the package xcms - wouldn't that be simply a feature?

If the package is about grouping features/"Features" shouldn't that be reflected in the name? So... wouldn't FeatureGroups or something similar feel more natural?

I am curious about the motivation for re-implemention this from scratch? The functionality is very much like CAMERA right? Looking through your documentation it was not clear to be how the correlation network is cut in the end? It does the refining of the groups one correlation at a time right? as opposed to CAMERA that adds the correlations.

If I was emperor with supreme powers I would suggest a package that wraps the now several packages that do something similar with a unified API. I guess I will wait for my infinite emperor grant to come through.

jorainer commented 4 years ago

Re package name, in RforMassSpec packages all follow the same naming conventions and start with a capital letter, that's also one reason against xcmsFeatures. Also I would prefer a MsFeatures over an FeatureGroups because the former is a little more generic. The package does something with MS features - grouping them might just be one thing.

Re CAMERA, yes, this is somewhat re-implementing part of its functionality. Honestly, even by looking through the code of CAMERA I did not exactly get how it performs all the correlations and grouping and how you can control that. This was when I then decided I wanted to re-implement it and split the functionality into different calls that can be all called separately or combined in any order (and the core functions are also independent of any class, so they could be re-used by other packages or for other stuff). I base all correlation groupings on a complete pairwise correlation matrix between all members (features) of a group. Then I start with the pair with the highest correlation put them into a group, and iteratively walk through all pairwise correlation (ordered by correlation coefficient) putting them into an existing group if their correlation is higher than the threshold. This approach creates feature groups in which all members have a correlation > threshold to each other. Here it would be nice to get your input and knowledge of CAMERA - maybe there's something better implemented that I have overseen.

And yes, there are many packages now around but I found most (all?) of them quite unusable, because they use their own class which only exists in this one package - and many packages even don't seem to be actively maintained/updated. So, the wrapper package would most of the time have to translate from one object to another and maintainance of this package would be a nightmare. My ideal approach would be to invite all these package maintainers to provide their core functionality and we put all of this in one core package. Something we did with the spectra similarity calculations in the Spectra package. Maybe something for the next metaRbolomics hackathon?

cbroeckl commented 2 years ago

quite some time ago after developing ramclustR i was interested in moving some of the approaches from it into CAMERA. This package, if it is reimplementing CAMERA for feature grouping, could be an opportunity to reinvest in that effort. @stanstrup @jorainer - i would like to work towards this if you think it suitable. There are a few differences (by my understanding) between the ramclustR approach and CAMERA.

Feature Grouping. If my memory serves me (it certainly may be failing me...) CAMERA is pretty sequential in how it is assigning ions to groups. i should probably revisit the source code/approach before making any conclusions, but i recall contrasting this with the ramclustR approach which generates a complete similarity network of all features to each other before pruning the full network into groups using the dynamicTreeCut package.
Feature annotation within groups. CAMERA assumes that there are multiple molecules in each groups, ramclustR assumes there is one molecule in each group. The flipside of this: CAMERA assumes that unexplained ions (those which cannot be annotated by the given rules) are second (third, fourth...) molecules, ramclustR assumes they derive from the same compound in some unexplained manner.

I am sure this is oversimplifying at least a bit, but this package will enable us to focus on the common goals of the two packages, and return a common data structure which better integrates with the rest of the R mass spec package family, which i would be excited to try to help with.

jorainer commented 2 years ago

This sounds really great Corey! I would love to integrate ramclustR with the MsFeatures package. The idea of MsFeatures is pretty simple, it takes an input object (XCMSnExp or SummarizedExperiment) and groups the features in it defining a character vector which represents the grouping (length of the vector is the same as there are features), grouped features will have the same feature group ID.

Limitations:

assignment is binary, i.e. a feature can only be part of one group.

Advantages:

very easy to add new grouping approaches
potential to combine (sequentially) grouping approaches

If we join forces I think we should be ablel to add that functionality preatty easily... I hope. Can you point me to some code/documentation I could start digging into ramclustR to better understand it?

cbroeckl commented 2 years ago

https://github.com/cbroeckl/RAMClustR/blob/master/R/rc.ramclustr.R

ramclustR adheres to your description - assignment is binary. the premise is to calculate the similarity matrix for all features, based on retention time similarity and quantitative similarity over the sample set (pearson's r). I didn't use peak shape initially, mostly due to a lack of skill in extracting that many EICs, but also because i was also ramclustR for grouping DIA fragment ions, derived from MSe data, which would have required a bit more code to properly adjust to align with the MS1. to manage memory, the full similarity matrix is calculated in (generally) 2000 feature chunks and the data is stored in an ff object. i explored sparse matrix implementations in the past, but that may be worth revisiting.

After you have a full similarity matrix, you cluster using HCA. memory is again a bit of a concern here, and it would good to look into the latest and greatest for reducing memory burden.

the dynamicTreeCut package is used to then cut the full HCA dendrogram into clusters, ideally representing a compound per cluster. The output of this is a numeric vector of length = n.features. I generally then sign a character name string to each compound as well: cluster #1 becomes 'C0001' etc.

happy to help, of course, just let me know when you plan to start working in this area. I don't know if my code is intelligible or not ;-).

jorainer commented 2 years ago

Thanks for the update! And sorry for my ignorance, but then it sounds like ramclustR groups based only on the MS1 properties retention time and quantified feature value and not considering also MS2 spectra? Is that correct?

cbroeckl commented 2 years ago

Correct. the algorithm does not use MS/MS data at all to assign clusters. The only properties are retention time and quantified feature values.

That said, it was originally developed with MSe in mind, and we can perform centWave signal detection on the MSe data, where MSe is full mass range MS/MS - i.e. the least selective form of DIA imaginable. Since MSe samples all precursors as frequently as we have MS1 level scans, we can use CentWave on MSe (MS2) chromatograms as well. If that has been done, RAMClustR can also cluster MS2 fragment ions to generate reconstructed MS2 spectra from MSe data.

In the MSe case, XCMS is performed on MS1 and MS2 data for each injection, the full xcms set is aligned and ramclustr splits MS1 from MS2 data. MS1 data is used for quantitative signal intensity, the MS2 data are used for annotation. MSe data is not required for clustering, but if you have MSe data processed with the MS1 data, we can use both to improve annotation.

It is an open question as to whether we should build that part (MSe, AIF) into this package.

rformassspectrometry / MsFeatures

New package for LC-MS/metabolomics feature grouping? #1