rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

feature request: conditional filtering of missing values #182

Open KristinaGomoryova opened 1 year ago

KristinaGomoryova commented 1 year ago

Hi,

would it be possible to add a function allowing to set a threshold in how many replicates of at least one condition can be protein missing? Similarly as it is done in filter_proteins() or filter_missval() functions in DEP R package.

Thanks for considering!

cvanderaa commented 1 year ago

Hi @KristinaGomoryova ,

If I understand correctly, you want a function where you can provide a group (eg experimental condition, phenotype,...) and a threshold. The function than looks at the number of missing values within each group. If at least one of the groups has a value lower than the threshold, you keep that protein, otherwise you discard it. Is it correct?

lgatto commented 1 year ago

Rather than filtering, I think we should aim for a function that tags proteins that match the desired criteria, either by returning a vector of booleans (for a SE) or list of booleans or adds new rowData variables. That way, the user can either user filterFeatures() or use that variable for mixed imputation.

cvanderaa commented 1 year ago

Two parallel ideas:

lgatto commented 1 year ago

Yes, filterNA() is also a good suggestion. But personally, I would also want to be able to look at these proteins - these are candidates that could have present/absent patterns, might not be amenable to statistical tests without imputation, and this lead to mixed imputation... hence with important downstream implications

So we could have a function that identifies these proteins, so that

KristinaGomoryova commented 1 year ago

Yes, I meant it exactly like you describe it, Chris - the threshold would mean maximum number (or percentage) of missing values allowed per condition, and if at least one condition has value lower, we want to keep that protein.

And I like Laurent's idea that they would be just labelled, although I am not sure if these are the present/absent ones - I think these are rather the ones we indeed want to filter out from the dataset (e.g. these will be the proteins, which were identified e.g. only in 1 out of 3 replicates in most conditions), but I might be wrong here

lgatto commented 1 year ago

I think we are talking about different things:

KristinaGomoryova commented 1 year ago

Now I get it, sorry for misunderstanding :)

It would be great to have both of these then!

lgatto commented 1 year ago

I was the one misunderstanding your initial request.