Closed michaelwitting closed 9 months ago
hm, could be that I already implemented this somewhere (can't remember if it was in xcms
or some other package). That's indeed an important function. Maybe named like closestPair
or similar? But it's definitely a tricky one!
I don't know, could be that I missed it so far. But I think this is definitely something for MetaboCoreUtils. Can be used in MS1 annotation, alignment etc...
Picking that issue up again: I would suggest the following definition:
mclosest <- function(x, table, ppm = 0, tolerance = Inf) {
...
}
where x
and table
can be two dimensional arrays (matrix
or data.frame
) with the same number of columns (doesn't have to be limited to 2). The function should then find for each row in x
the row in table
with the smallest distance considering each pair of columns (i.e. smallest difference between column 1 in both arrays, column 2 in both arrays etc). Other properties:
ppm
and tolerance
should be numeric
of length 1 or equal to the number of columns of x
.integer
of length equal to the number of rows of x
, each element being the index (row) in table
with the closest values.x
and table
are data frames with m/z and retention time values.Implementation suggestion:
x
and tables
(i.e. absolute difference of values in column 1 of x
and table
, absolute difference of values in column 2 of x
and table
etc.) - might be that we will need to loop over rows in x
- or alternatively do some matrix operation?ppm
and tolerance
with NA
x
the index of the row in table
with the lowest rank productThe name mclosest
should tell that this is a multi closest calculation... not perfect name, so open for alternative suggestions.
would that be something you would be OK with @michaelwitting ? I could let Philippine @philouail implement that.
Will this always match columns called mz
and then the additional one?
I'm just thinking how this could be used in a flexible manner to match retention times or collisional cross sections. Shall the user be allowed to define name of the column, which shall be used for the additional matching? Of course it has to be present then in both input data frames.
I would require that both x
and table
have the same number of columns. That would keep this function very generic and could be applied to many different use cases. The user has to ensure that these are provided in the correct order (i.e. first columns being m/z, second columns retention times, third columns ...).
Examples:
mclosest(a[, "mzmed", "rtmed"], b[, "mz", "rt"])
would return for each row in a
the index in table
with the best match.mclosest(a[, "mzmed", "rtmed"], b[, "mz", "rt"], ppm = 0, tolerance = c(0.01, 2))
would also return the best match, but only if the difference between the m/z values in a
and b
is below 0.01 and the difference in retention times is below 2.does this make sense?
Makes totally sense to me.
@philouail implemented this now (PR #71). It's in the main branch and I'll push to Bioconductor.
One thing I have to do quite often is to search for a specific m/z - RT pair in a data set. A function similar to
closest
would be great, but only returning the closest, but all matches. Maybe to be fit for the future already include the possibility of CCS values?What do you think @jorainer ?