rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

Consistent use of 'pNA' #189

Closed Charl-Hutchings closed 1 year ago

Charl-Hutchings commented 1 year ago

Hello there,

I would like to request that use of the term 'pNA' be used consistently throughout all QFeatures functions. When looking at the output of nNA(), the pNA column represents the percentage of NA values. However, within the filterNA() function, pNA refers to a proportion.

This can lead to confusion for users of both functions.

For example, if I were to use the code:

nNA(qf[["assay_name"]])$nNArows$pNA >= 20 %>% which() %>% length()

I would expect to see the number of features with >= 20% missing values. However, if I then wished to remove these features using filterNA(), I would need to input 20% as a proportion i.e., 0.2:

qf %>% filterNA(pNA = 0.2, i = "assay_name")

Could this be changed to make the functions more user-friendly? Apologies in advance if I am missing something obvious.

Charlotte

lgatto commented 1 year ago

Could you elaborate and provide a reproducible example that illustrates your issue.

Here's what I see using se_na2 as an example. Out of the 16 samples, 652 (37) have less (more) than 20 percent missing values

> library(QFeatures)
> dim(se_na2)
[1] 689  16
> table(nNA(se_na2)$nNArows$pNA >= 20)
FALSE  TRUE 
  652    37 

If I now use filterNA(), I am interested in keeping features with fewer missing values, i.e. 20% or less

> dim(filterNA(se_na2, pNA = 0.20))
[1] 652  16

and I get the expected 652.

The concept of proportion is very useful for filterNA(), as it gives a direct relation to my number of samples - I want features that have 3 missing values or less (which here corresponds to the integer closest to 20%, 3/16 = 0.1875):

> dim(filterNA(se_na2, pNA = 3/16))
[1] 652  16

My interpretation is that the issue isn't the percentage/proportion distinction, but the fact that filterNA()keeps rows with pNA or less missing values, and your example counts values greater than your percentage.

You would get what you want by counting rows < 20 (I'm simplifying your code with sum() rather than which() |> length())

> sum(nNA(se_na2)$nNArows$pNA < 20) 
[1] 652
> dim(filterNA(se_na2, pNA = 0.2))
[1] 652  16

Note that the filterNA() manual page documents pNA as the percentage, but as a user, I tend to use/think of proportions because it is intuitive and simple to compute: "I want n values out of m samples, so I simply use pNA = n/m", rather than having derive the percentage.

Or is it the 20 vs 0.2 that annoys you? Am I missing something?

Charl-Hutchings commented 1 year ago

Hi Laurent,

Thanks for getting back to me! I understand the benefit of using proportion. My point was about the contradicting use of 'pNA' to mean percentage in one case (nNA) and proportion in the other (filterNA).

As you point out, the filterNA() manual page states that pNA is the percentage. If I were to put pNA = 20 (meaning 20%), then I would not filter correctly.

Perhaps you could consider altering the wording in the filterNA() documentation to make clear that users should provide their percentage as a proportion or fraction. Alternatively, the pNA column returned by nNA could be modified to output proportion rather than percentage, as this would make the two consistent.

Whilst it is usually obvious, colleagues sometimes become confused when the number of missing values is very low. In such cases, the percentage may be 0.2 % NA, but if interpreted incorrectly as a proportion, this indicates 20% NA. When switching between the functions, confusion arises.

Does that make sense?

Best, Charlotte

lgatto commented 1 year ago

How's this: https://rformassspectrometry.github.io/QFeatures/reference/QFeatures-missing-data.html

Charl-Hutchings commented 1 year ago

Amazing - thank you!