Closed jorainer closed 6 months ago
Why logical
and not the value itself? The user-friendly functions can then perform the filtering.
For the the Dratio one would need additionally the grouping information if it is a QC or sample. How would that be supplied?
Another factor could be the QC(or sample)-to-blank ratio.
yes, that makes sense. maybe we can have basic functions to calculate the dratio or one for the CV. The dratio function should take an additional factor as input that defines which columns belong to QC samples and which to study samples (with eventually NAs for columns that should be skipped (e.g. blanks)).
Let's start from there and build on top of these very basic functions.
To get from the value to the logical is then anyway just a simple <= 0.5
or similar.
True, but not always obvious for non-R-experts ;-)
Shall the basic function work on a matrix already or better a vector?
I'd say it should work on a matrix
. Input is expected to be e.g. the featureValues
from an xcms
result object.
Okay... the second input needs to be a vector with group information then. How do we handle different nomenclature? E.g. "QC" vs "qc".
wow, you really want to know how that's going to work :)
So, my idea was. But open to discussion @philouail @michaelwitting :
## Get feature abundances from the xcms result object.
vals <- featureValues(xdata)
## Extract group assignment from the object - assuming the user
## has defined the "type of sample" in a variable called e.g. "sample_type"
f <- sampleData(xdata)$sample_type
## Let's assume f looks now like:
f <- c("blank", "QC", "study", "QC", "study", "QC")
## I would suggest to convert that to a `factor` where the reference level (i.e. the first level) is
## the QC:
f <- factor(f, levels = c("QC", "study"))
f
[1] <NA> QC study QC study QC
Levels: QC study
## so, only QC and study would be considered. Blank has an NA, thus would not be included
## in the calculation.
So, the whole "trick" is just how to define what is QC and what study, and to allow also exclusion of samples that should not be included (like blanks). The latter is easily possible with NA
values in the factor
. We would need to ensure that the factor has only 2 levels, and always consider the first level to represent QC samples (or any other sample that represents technical variance).
Should then be simple to split the data using the factor (by column) and perform the calculation of the dratio.
Obviously there would be different ways how the factor can be defined. Also something like this would be possible:
f <- as.integer(sampleData(xdata)$sample_type != "QC")
## f would then be 0 for QC samples, 1 for all others. Maybe there would be the need to remove blanks:
f[sampleData(xdata)$sample_type == "blank"] <- NA_integer_
## This f could now be directly passed to the function, or also converted to a factor before
f <- factor(f)
## the first level would then be the 0 (representing QCs)
So, the rowDratio
and dratio
functions should take a numeric
matrix as input as well as a factor
or any other vector that will be converted to a factor
(parameter f
). Inside the function we would need to check
length
of f
equal to ncol(x)
length
of levels(f)
== 2Does this make sense? Again, this is a core function. More user-friendly functions could be implemented for e.g. XcmsExperiment
of SummarizedExperiment
objects.
So we don't expect the user to separate their QC and study results ? I was thinking they could just input x = study_sample_matrix and y= qc_sample_matrix
Makes total sense, yes. I was maybe overcomplicating/engineering here. So, for dratio
it could be a definition like:
dratio(x = matrix(), y = matrix())
Oh btw @michaelwitting and @jorainer about the sample/blank ratio,
I completely agree but I was thinking removing feature based on that is maybe not ideal ? for now what I am doing is "flagging" the features that have an intensity in study samples less than 2x the intensity in blank as "possible_contaminant": I get a logical vector that I add to my featureDefinitions
as a new column.
Essentially this would need to be considered if after differential abundance analysis, one of the significant feature has this "flag". Then I would look more into it (peak shape in blank vs sample, annotation and relevance,...). What do you think ? would it be worth to implement this "flagging" option ? or you think it's better to remove directly ?
I think the function shall remove a vector with the same length as features are supplied (aka number of rows) with the values, either the D-Ratio, sample/blank ratio, presence ratio in sample group etc... The users can decide on their own what the want to filter. In MetaboCoreUtils it is all about basic functions. In MsFeatures for examples a featureDenitions with a new column can be returned
For the blanks I agree - better to flag than to remove.
But I would start to first implement the basic core functions to calculate row wise CV and D-ratios.
In a next step I would then implement the filter functions (actually, better a filterFeatures method with param?). These should then remove features based on some criteria from a result object (could be a XcmsExperiment
- then the method should go to xcms
, or a SummarizedExperiment
- then the method should go - where? MetaboAnnotation
? MsFeatures
? - I would not like to add that then to MetaboCoreUtils
because of the then required dependency on the SummarizedExperiment
package.
Example for use of a filter method:
lcms_data <- filterFeatures(lcms_data, param = DratioParam(f = sampleData(lcms_data)$sample_type == "QC", threshold = 0.5))
where lcms_data
would be an XcmsExperiment
. And here we need to let the user give the possibility to choose/define the parameter f
. He/she should know which samples are QCs, which are study samples and should provide this information. Various filtering steps (D-ratio, CV, ...) could then be applied to the object consecutively.
Hope that makes sense?
Makes complete sense ! Thanks for the directions :)
Add functions that help pre-filtering (untargeted) metabolomics data based on (and calculating the):
These functions should take a feature abundance matrix, and eventual additional information, as input and should return a
logical
vector of length equal to the number of features (row) of the abundance matrix. That way, different filters can be combined through boolean operations.A more user-friendly/centered function could then also be implemented in the respective packages (
xcms
?MetaboAnnotation
?).