Open KristinaGomoryova opened 1 year ago
Hi @KristinaGomoryova ,
If I understand correctly, you want a function where you can provide a group (eg experimental condition, phenotype,...) and a threshold. The function than looks at the number of missing values within each group. If at least one of the groups has a value lower than the threshold, you keep that protein, otherwise you discard it. Is it correct?
Rather than filtering, I think we should aim for a function that tags proteins that match the desired criteria, either by returning a vector of booleans (for a SE) or list of booleans or adds new rowData variables. That way, the user can either user filterFeatures()
or use that variable for mixed imputation.
Two parallel ideas:
filterNA()
by adding a groupBy
argument.filterFeatures()
, but then should apply this "tagging" behavior to filterNA()
?Yes, filterNA()
is also a good suggestion. But personally, I would also want to be able to look at these proteins - these are candidates that could have present/absent patterns, might not be amenable to statistical tests without imputation, and this lead to mixed imputation... hence with important downstream implications
So we could have a function that identifies these proteins, so that
randna
parameter) fcol
parameter to filterNA()
, although filterFeatures()
would fit out of the boxYes, I meant it exactly like you describe it, Chris - the threshold would mean maximum number (or percentage) of missing values allowed per condition, and if at least one condition has value lower, we want to keep that protein.
And I like Laurent's idea that they would be just labelled, although I am not sure if these are the present/absent ones - I think these are rather the ones we indeed want to filter out from the dataset (e.g. these will be the proteins, which were identified e.g. only in 1 out of 3 replicates in most conditions), but I might be wrong here
I think we are talking about different things:
filterNA()
, that takes groups into account. Yes, that is indeed a sensible request. This would be addressed by a new groupBy
argument to filterNA()
. @KristinaGomoryova, feel free to send a PR if you want. naPatterns()
, or something along those lines.Now I get it, sorry for misunderstanding :)
It would be great to have both of these then!
I was the one misunderstanding your initial request.
Hi,
would it be possible to add a function allowing to set a threshold in how many replicates of at least one condition can be protein missing? Similarly as it is done in
filter_proteins()
orfilter_missval()
functions in DEP R package.Thanks for considering!