Add functions that allow to pre-filter features

jorainer commented 11 months ago

Add functions that help pre-filtering (untargeted) metabolomics data based on (and calculating the):

Dratio
coefficient of variation (CV)
% of samples of a phenotype group in which values are available (e.g. a peak was detected)

These functions should take a feature abundance matrix, and eventual additional information, as input and should return a logical vector of length equal to the number of features (row) of the abundance matrix. That way, different filters can be combined through boolean operations.

A more user-friendly/centered function could then also be implemented in the respective packages (xcms? MetaboAnnotation?).

michaelwitting commented 11 months ago

Why logical and not the value itself? The user-friendly functions can then perform the filtering. For the the Dratio one would need additionally the grouping information if it is a QC or sample. How would that be supplied?

Another factor could be the QC(or sample)-to-blank ratio.

jorainer commented 11 months ago

yes, that makes sense. maybe we can have basic functions to calculate the dratio or one for the CV. The dratio function should take an additional factor as input that defines which columns belong to QC samples and which to study samples (with eventually NAs for columns that should be skipped (e.g. blanks)).

Let's start from there and build on top of these very basic functions.

jorainer commented 11 months ago

To get from the value to the logical is then anyway just a simple <= 0.5 or similar.

michaelwitting commented 11 months ago

True, but not always obvious for non-R-experts ;-)

michaelwitting commented 11 months ago

Shall the basic function work on a matrix already or better a vector?

jorainer commented 11 months ago

I'd say it should work on a matrix. Input is expected to be e.g. the featureValues from an xcms result object.

michaelwitting commented 11 months ago

Okay... the second input needs to be a vector with group information then. How do we handle different nomenclature? E.g. "QC" vs "qc".

jorainer commented 11 months ago

wow, you really want to know how that's going to work :)

So, my idea was. But open to discussion @philouail @michaelwitting :

## Get feature abundances from the xcms result object.
vals <- featureValues(xdata)

## Extract group assignment from the object - assuming the user 
## has defined the "type of sample" in a variable called e.g. "sample_type" 
f <- sampleData(xdata)$sample_type

## Let's assume f looks now like:
f <- c("blank", "QC", "study", "QC", "study", "QC")

## I would suggest to convert that to a `factor` where the reference level (i.e. the first level) is
## the QC:
f <- factor(f, levels = c("QC", "study"))
f
[1] <NA>  QC    study QC    study QC   
Levels: QC study

## so, only QC and study would be considered. Blank has an NA, thus would not be included
## in the calculation.

So, the whole "trick" is just how to define what is QC and what study, and to allow also exclusion of samples that should not be included (like blanks). The latter is easily possible with NA values in the factor. We would need to ensure that the factor has only 2 levels, and always consider the first level to represent QC samples (or any other sample that represents technical variance).

Should then be simple to split the data using the factor (by column) and perform the calculation of the dratio.

Obviously there would be different ways how the factor can be defined. Also something like this would be possible:

f <- as.integer(sampleData(xdata)$sample_type != "QC")

## f would then be 0 for QC samples, 1 for all others. Maybe there would be the need to remove blanks:
f[sampleData(xdata)$sample_type == "blank"] <- NA_integer_

## This f could now be directly passed to the function, or also converted to a factor before
f <- factor(f)

## the first level would then be the 0 (representing QCs)

So, the rowDratio and dratio functions should take a numeric matrix as input as well as a factor or any other vector that will be converted to a factor (parameter f). Inside the function we would need to check

length of f equal to ncol(x)
length of levels(f) == 2

Does this make sense? Again, this is a core function. More user-friendly functions could be implemented for e.g. XcmsExperiment of SummarizedExperiment objects.

philouail commented 11 months ago

So we don't expect the user to separate their QC and study results ? I was thinking they could just input x = study_sample_matrix and y= qc_sample_matrix

jorainer commented 11 months ago

Makes total sense, yes. I was maybe overcomplicating/engineering here. So, for dratio it could be a definition like:

dratio(x = matrix(), y = matrix())

philouail commented 11 months ago

Oh btw @michaelwitting and @jorainer about the sample/blank ratio, I completely agree but I was thinking removing feature based on that is maybe not ideal ? for now what I am doing is "flagging" the features that have an intensity in study samples less than 2x the intensity in blank as "possible_contaminant": I get a logical vector that I add to my featureDefinitions as a new column. Essentially this would need to be considered if after differential abundance analysis, one of the significant feature has this "flag". Then I would look more into it (peak shape in blank vs sample, annotation and relevance,...). What do you think ? would it be worth to implement this "flagging" option ? or you think it's better to remove directly ?

michaelwitting commented 11 months ago

I think the function shall remove a vector with the same length as features are supplied (aka number of rows) with the values, either the D-Ratio, sample/blank ratio, presence ratio in sample group etc... The users can decide on their own what the want to filter. In MetaboCoreUtils it is all about basic functions. In MsFeatures for examples a featureDenitions with a new column can be returned

jorainer commented 11 months ago

For the blanks I agree - better to flag than to remove.

But I would start to first implement the basic core functions to calculate row wise CV and D-ratios.

In a next step I would then implement the filter functions (actually, better a filterFeatures method with param?). These should then remove features based on some criteria from a result object (could be a XcmsExperiment - then the method should go to xcms, or a SummarizedExperiment - then the method should go - where? MetaboAnnotation? MsFeatures? - I would not like to add that then to MetaboCoreUtils because of the then required dependency on the SummarizedExperiment package.

Example for use of a filter method:

lcms_data <- filterFeatures(lcms_data, param = DratioParam(f = sampleData(lcms_data)$sample_type == "QC", threshold = 0.5))

where lcms_data would be an XcmsExperiment. And here we need to let the user give the possibility to choose/define the parameter f. He/she should know which samples are QCs, which are study samples and should provide this information. Various filtering steps (D-ratio, CV, ...) could then be applied to the object consecutively.

Hope that makes sense?

philouail commented 11 months ago

Makes complete sense ! Thanks for the directions :)

rformassspectrometry / MetaboCoreUtils

Add functions that allow to pre-filter features #77