rformassspectrometry / MetaboCoreUtils

Core utilities for metabolomics.
https://rformassspectrometry.github.io/MetaboCoreUtils/index.html
7 stars 6 forks source link

closest for m/z - RT pair #20

Closed michaelwitting closed 9 months ago

michaelwitting commented 4 years ago

One thing I have to do quite often is to search for a specific m/z - RT pair in a data set. A function similar to closest would be great, but only returning the closest, but all matches. Maybe to be fit for the future already include the possibility of CCS values?

What do you think @jorainer ?

jorainer commented 3 years ago

hm, could be that I already implemented this somewhere (can't remember if it was in xcms or some other package). That's indeed an important function. Maybe named like closestPair or similar? But it's definitely a tricky one!

michaelwitting commented 3 years ago

I don't know, could be that I missed it so far. But I think this is definitely something for MetaboCoreUtils. Can be used in MS1 annotation, alignment etc...

jorainer commented 10 months ago

Picking that issue up again: I would suggest the following definition:

mclosest <- function(x, table, ppm = 0, tolerance = Inf) {
...
}

where x and table can be two dimensional arrays (matrix or data.frame) with the same number of columns (doesn't have to be limited to 2). The function should then find for each row in x the row in table with the smallest distance considering each pair of columns (i.e. smallest difference between column 1 in both arrays, column 2 in both arrays etc). Other properties:

Implementation suggestion:

The name mclosest should tell that this is a multi closest calculation... not perfect name, so open for alternative suggestions.

would that be something you would be OK with @michaelwitting ? I could let Philippine @philouail implement that.

michaelwitting commented 10 months ago

Will this always match columns called mz and then the additional one? I'm just thinking how this could be used in a flexible manner to match retention times or collisional cross sections. Shall the user be allowed to define name of the column, which shall be used for the additional matching? Of course it has to be present then in both input data frames.

jorainer commented 10 months ago

I would require that both x and table have the same number of columns. That would keep this function very generic and could be applied to many different use cases. The user has to ensure that these are provided in the correct order (i.e. first columns being m/z, second columns retention times, third columns ...).

Examples:

does this make sense?

michaelwitting commented 10 months ago

Makes totally sense to me.

jorainer commented 9 months ago

@philouail implemented this now (PR #71). It's in the main branch and I'll push to Bioconductor.