closest for m/z - RT pair

michaelwitting commented 4 years ago

One thing I have to do quite often is to search for a specific m/z - RT pair in a data set. A function similar to closest would be great, but only returning the closest, but all matches. Maybe to be fit for the future already include the possibility of CCS values?

What do you think @jorainer ?

jorainer commented 3 years ago

hm, could be that I already implemented this somewhere (can't remember if it was in xcms or some other package). That's indeed an important function. Maybe named like closestPair or similar? But it's definitely a tricky one!

michaelwitting commented 3 years ago

I don't know, could be that I missed it so far. But I think this is definitely something for MetaboCoreUtils. Can be used in MS1 annotation, alignment etc...

jorainer commented 10 months ago

Picking that issue up again: I would suggest the following definition:

mclosest <- function(x, table, ppm = 0, tolerance = Inf) {
...
}

where x and table can be two dimensional arrays (matrix or data.frame) with the same number of columns (doesn't have to be limited to 2). The function should then find for each row in x the row in table with the smallest distance considering each pair of columns (i.e. smallest difference between column 1 in both arrays, column 2 in both arrays etc). Other properties:

ppm and tolerance should be numeric of length 1 or equal to the number of columns of x.
the result should be an integer of length equal to the number of rows of x, each element being the index (row) in table with the closest values.
I would not use any similarity algorithm (like euclidian distance or similar) to calculate the similarity, because the columns are expected to contain values with different units (e.g. if x and table are data frames with m/z and retention time values.

Implementation suggestion:

calculate absolute difference between pairwise columns in x and tables (i.e. absolute difference of values in column 1 of x and table, absolute difference of values in column 2 of x and table etc.) - might be that we will need to loop over rows in x - or alternatively do some matrix operation?
replace differences larger than allowed by ppm and tolerance with NA
rank differences (or replace with their order)
return for each row in x the index of the row in table with the lowest rank product

The name mclosest should tell that this is a multi closest calculation... not perfect name, so open for alternative suggestions.

would that be something you would be OK with @michaelwitting ? I could let Philippine @philouail implement that.

michaelwitting commented 10 months ago

Will this always match columns called mz and then the additional one? I'm just thinking how this could be used in a flexible manner to match retention times or collisional cross sections. Shall the user be allowed to define name of the column, which shall be used for the additional matching? Of course it has to be present then in both input data frames.

jorainer commented 10 months ago

I would require that both x and table have the same number of columns. That would keep this function very generic and could be applied to many different use cases. The user has to ensure that these are provided in the correct order (i.e. first columns being m/z, second columns retention times, third columns ...).

Examples:

mclosest(a[, "mzmed", "rtmed"], b[, "mz", "rt"]) would return for each row in a the index in table with the best match.
mclosest(a[, "mzmed", "rtmed"], b[, "mz", "rt"], ppm = 0, tolerance = c(0.01, 2)) would also return the best match, but only if the difference between the m/z values in a and b is below 0.01 and the difference in retention times is below 2.

does this make sense?

michaelwitting commented 10 months ago

Makes totally sense to me.

jorainer commented 9 months ago

@philouail implemented this now (PR #71). It's in the main branch and I'll push to Bioconductor.

rformassspectrometry / MetaboCoreUtils

closest for m/z - RT pair #20