feature request: conditional filtering of missing values

rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data

https://RforMassSpectrometry.github.io/QFeatures/

25 stars 7 forks source link

feature request: conditional filtering of missing values #182

Open KristinaGomoryova opened 1 year ago

KristinaGomoryova commented 1 year ago

Hi,

would it be possible to add a function allowing to set a threshold in how many replicates of at least one condition can be protein missing? Similarly as it is done in filter_proteins() or filter_missval() functions in DEP R package.

Thanks for considering!

cvanderaa commented 1 year ago

Hi @KristinaGomoryova ,

If I understand correctly, you want a function where you can provide a group (eg experimental condition, phenotype,...) and a threshold. The function than looks at the number of missing values within each group. If at least one of the groups has a value lower than the threshold, you keep that protein, otherwise you discard it. Is it correct?

lgatto commented 1 year ago

Rather than filtering, I think we should aim for a function that tags proteins that match the desired criteria, either by returning a vector of booleans (for a SE) or list of booleans or adds new rowData variables. That way, the user can either user filterFeatures() or use that variable for mixed imputation.

cvanderaa commented 1 year ago

Two parallel ideas:

I think what Kristina wants could be implement in filterNA() by adding a groupBy argument.
I like the idea of adding a tag and then call filterFeatures(), but then should apply this "tagging" behavior to filterNA()?

lgatto commented 1 year ago

Yes, filterNA() is also a good suggestion. But personally, I would also want to be able to look at these proteins - these are candidates that could have present/absent patterns, might not be amenable to statistical tests without imputation, and this lead to mixed imputation... hence with important downstream implications

So we could have a function that identifies these proteins, so that

we can explore visualise them (for instance with a heatmap)
impute with mixed imputation (randna parameter)
to remove these, we could consider adding a fcol parameter to filterNA(), although filterFeatures() would fit out of the box

KristinaGomoryova commented 1 year ago

Yes, I meant it exactly like you describe it, Chris - the threshold would mean maximum number (or percentage) of missing values allowed per condition, and if at least one condition has value lower, we want to keep that protein.

And I like Laurent's idea that they would be just labelled, although I am not sure if these are the present/absent ones - I think these are rather the ones we indeed want to filter out from the dataset (e.g. these will be the proteins, which were identified e.g. only in 1 out of 3 replicates in most conditions), but I might be wrong here

lgatto commented 1 year ago

I think we are talking about different things:

Indeed, @KristinaGomoryova wants a better way to specify a threshold in filterNA(), that takes groups into account. Yes, that is indeed a sensible request. This would be addressed by a new groupBy argument to filterNA(). @KristinaGomoryova, feel free to send a PR if you want.
I was referring to something else, which would be based on a similar logic: are there any proteins that are (mostly) present in one/multiple group(s) and (mostly) absent in another/others. I think this would deserve a new function, such as, for example, naPatterns(), or something along those lines.

KristinaGomoryova commented 1 year ago

Now I get it, sorry for misunderstanding :)

It would be great to have both of these then!

lgatto commented 1 year ago

I was the one misunderstanding your initial request.