rformassspectrometry / MetaboAnnotation

High level functionality to support and simplify metabolomics data annotation.
https://rformassspectrometry.github.io/MetaboAnnotation/
12 stars 9 forks source link

matchApply function to allow applying any function to each match #84

Closed jorainer closed 1 year ago

jorainer commented 2 years ago

It would be helpful to have a function that iterates over a Matched object and allows to apply a user provied function to each. The function matchApply should take a Matched object as input and should return a Matched object, with maybe changed or reduced @matches slot. The idea would be that a user might want to e.g. restrict found matches for each query based on some custom, user provided, criteria.

The definition of the function could be:

matchApply <- function(object, FUN, ...) {}

FUN being a user defined function that must take input arguments match, query, target, ... (match being a data.frame with the @matches for one query, query the @query slot, target the @target slot and ... optional additional arguments). The function must return a data.frame with (at least) the same columns than match, but potentially different number of rows.

matchApply would basically split the @matches data.frame by $query_idx and lapply over this list applying FUN. The result would then be rbinded again and replace the @matches of the Matched object.

Maybe we could even be more flexible to not enforce returning a Matched object, but having e.g. a parameter returnMatched that, if set to FALSE simply returns the result from the lapply without further processing into a Matched result object.

One use case could be the following:

Given a Matched object mtch with results from a matchValues function in which a more relaxed matching was performed (e.g. a large tolerance): iterate over all matches and keep only those with a score (difference in m/z) smaller than a more strict value.

Maybe a more reasonable use case could be: have a MatchedSpectra object with results from query against a full database. The user has a set of compounds for which he is sure that only these could be measured in the analysed sample. So, iterate over the matches of each query and keep only those against target spectra of a certain compound.

Happy to discuss that @andreavicini if something is not clear.

andreavicini commented 2 years ago

I should have more or less implemented the functionality above but I have a doubt. It seems to me that for the use cases above maybe it would not be necessary to split the @matches data.frame by $query_idx but it would be sufficient to apply FUN to the whole @matches. Is that right? But I suppose splitting would open up more possibilities?

jorainer commented 2 years ago

Sorry, maybe I was not clear. So, the idea was to split the @matches by $query_idx and then loop over this list passing the whole @query and @target to the FUN, but only the current subset of the matches. This would allow to apply any function to the matching result of one query to either subset and filter the matches and return only one match (e.g. the one with the highest score) or do also other things.