Closed Charl-Hutchings closed 1 year ago
Could you elaborate and provide a reproducible example that illustrates your issue.
Here's what I see using se_na2
as an example. Out of the 16 samples, 652 (37) have less (more) than 20 percent missing values
> library(QFeatures)
> dim(se_na2)
[1] 689 16
> table(nNA(se_na2)$nNArows$pNA >= 20)
FALSE TRUE
652 37
If I now use filterNA()
, I am interested in keeping features with fewer missing values, i.e. 20% or less
> dim(filterNA(se_na2, pNA = 0.20))
[1] 652 16
and I get the expected 652.
The concept of proportion is very useful for filterNA()
, as it gives a direct relation to my number of samples - I want features that have 3 missing values or less (which here corresponds to the integer closest to 20%, 3/16 = 0.1875):
> dim(filterNA(se_na2, pNA = 3/16))
[1] 652 16
My interpretation is that the issue isn't the percentage/proportion distinction, but the fact that filterNA()
keeps rows with pNA
or less missing values, and your example counts values greater than your percentage.
You would get what you want by counting rows < 20 (I'm simplifying your code with sum()
rather than which() |> length()
)
> sum(nNA(se_na2)$nNArows$pNA < 20)
[1] 652
> dim(filterNA(se_na2, pNA = 0.2))
[1] 652 16
Note that the filterNA()
manual page documents pNA
as the percentage, but as a user, I tend to use/think of proportions because it is intuitive and simple to compute: "I want n values out of m samples, so I simply use pNA = n/m
", rather than having derive the percentage.
Or is it the 20 vs 0.2 that annoys you? Am I missing something?
Hi Laurent,
Thanks for getting back to me! I understand the benefit of using proportion. My point was about the contradicting use of 'pNA' to mean percentage in one case (nNA) and proportion in the other (filterNA).
As you point out, the filterNA() manual page states that pNA is the percentage. If I were to put pNA = 20 (meaning 20%), then I would not filter correctly.
Perhaps you could consider altering the wording in the filterNA() documentation to make clear that users should provide their percentage as a proportion or fraction. Alternatively, the pNA column returned by nNA could be modified to output proportion rather than percentage, as this would make the two consistent.
Whilst it is usually obvious, colleagues sometimes become confused when the number of missing values is very low. In such cases, the percentage may be 0.2 % NA, but if interpreted incorrectly as a proportion, this indicates 20% NA. When switching between the functions, confusion arises.
Does that make sense?
Best, Charlotte
How's this: https://rformassspectrometry.github.io/QFeatures/reference/QFeatures-missing-data.html
nNA()$...$pNA
now returns proportions (values aren't multiplied by 100 anymore).filterNA()
checks that 0 <= pNA
<=1, otherwise returns an error. Amazing - thank you!
Hello there,
I would like to request that use of the term 'pNA' be used consistently throughout all QFeatures functions. When looking at the output of nNA(), the pNA column represents the percentage of NA values. However, within the filterNA() function, pNA refers to a proportion.
This can lead to confusion for users of both functions.
For example, if I were to use the code:
nNA(qf[["assay_name"]])$nNArows$pNA >= 20 %>% which() %>% length()
I would expect to see the number of features with >= 20% missing values. However, if I then wished to remove these features using filterNA(), I would need to input 20% as a proportion i.e., 0.2:
qf %>% filterNA(pNA = 0.2, i = "assay_name")
Could this be changed to make the functions more user-friendly? Apologies in advance if I am missing something obvious.
Charlotte