microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
46 stars 25 forks source link

Use `pseudocount=TRUE` in `transformAssay` even with missing or zero values #489

Open artur-sannikov opened 5 months ago

artur-sannikov commented 5 months ago

Is your feature request related to a problem? Please describe.

At the moment, transformAssay function requires specifying a numerical value manually if there are negative or missing values in the assay.

Describe the solution you'd like Even if we have missing values or zero values, it's possible to calculate the minimum non-NA value when pseudocount=TRUE. That is required anyway if The assay contains missing or negative values. "'pseudocount' must be specified manually.

My solution is simple:

# Logical vector to for all NAs and 0 values of relative abundance assay
relabundance_no_na_and_zero <- !is.na(assay(tse, "relabundance")) & assay(tse, "relabundance") != 0

# Minimum value of relative abundance assay divided by 2 to be added
# as pseudocount for calculation of centered log-ratio
# Only specifying pseudocount=TRUE for an automatic calculation does not work
# due to the presence of NAs
relabundance_pseudocount <- min(assay(tse, "relabundance")[
  relabundance_no_na_and_zero
]) / 2

The same solution can work also if there are negative values. We just need to add them in the logical vector.

antagomir commented 5 months ago

Do you mean that pseudocount=TRUE does not work if the data contains zero or missing values? I am not sure if I understand whether the proposed solution here is to replace pseudocount=TRUE automated calculation with more lengthy manual calculation?

artur-sannikov commented 5 months ago

Do you mean that pseudocount=TRUE does not work if the data contains zero or missing values?

Sorry, I just realized I wrote 0 or missing values. I meant negative or missing values!

It will tell you that the data contains negative or missing values:

Error: The assay contains missing or negative values. 'pseudocount' must be specified manually.

I then manually calculate the required pseudocount value (which is the minimum value of relabundance / 2). We can skip this step if pseudocount=TRUE calculates the minimum value or sets it to 0 even if we have missing or negative values like it's already done if there are not missing or negative values. That is what I do anyway in the code above.

I understand whether the proposed solution here is to replace pseudocount=TRUE automated calculation with more lengthy manual calculation?

Yes, sort of. The automatic calculation at the moment assumes that there are no negative or missing values and sets the pseudocount to the minimum value:

# If pseudocount TRUE, set it to non-zero minimum value, else set it to zero
pseudocount <- ifelse(pseudocount, min(mat[mat>0]), 0)
antagomir commented 5 months ago

Ok, so let me gather the suggestion:

  1. If there are no negative or missing values, then nothing needs to be changed (ok?)
  2. If there are negative or missing values, then you suggest to modify the calculation so that those would be replaced by 0?

Correct me if I misunderstood it.

The problem with (2) is that there are no common use cases (that I am aware of) where negative values could be safely replaced with zeroes. Missing values perhaps a bit more safely, we could interpret them as missing (hence 0) observations but even that is a potentially risky assumption. In most use cases I expect that the user would want the code fail if pseudocount is added on negative or missing values. At least we should carefully consider what the expected use cases would be.

artur-sannikov commented 5 months ago
  1. Yes
  2. I see. you are saying that in case of some assays, like "clr" where negative values are possible, if we take the minimum positive value, and add it to those negative values, they might still remain negative because the positive value is not big enough. However, in case of relative abundances or counts, having negative values is a bit strange anyway, so maybe in this case there should be some warning specific to "counts" or "relabundance" assays or any other assay where negative numbers are strange? Of course, in this case, we assume the names of the assays.

Also, I would not interpret missing values as 0s because 0 is not a missing value, it has a meaning in assays. A value that is missing in reality can be any number, so I agree that this assumption is risky.

antagomir commented 5 months ago

It is a conscious design decision that we do not assume anything based on (ultimately user-defined) assay names. Hence we cannot give warnings for "counts" or "relabundance" assays based on their name. We would be only looking at the actual values in the data.

1) Pseudocounts are usually only applied on data sets with non-negative values. I am not aware of other applications at least in the microbiome context. Hence my current suggestion is that this function will throw an error if one will try to apply pseudocount on data that has negative values. Or is there an alternative suggestion?

2) Missing data, I think the user should decide what to do with that. We could in principle facilitate that e.g. by providing some imputation functions but they are also otherwise available, not sure about the added value.

artur-sannikov commented 5 months ago

In the case of my data, I tried to transform relative abundances into clr. But because I had some missing values, I got the error from my first message. My next step was to get the minimum non-zero non-missing value to use as pseudocount, so I thought if this can be calculated via pseudocount=TRUE instead of going through manual process like I did.

But now I see, that my case was specific to my relabundance assay and cannot be applicable to every possible situation / assay name.

antagomir commented 5 months ago

Relative abundances are always non-negative and pseudocount=TRUE should work but perhaps it doesn't when you have missing values.

My suggestion here is to check the following (without implying who would do this):

Does that make sense?

artur-sannikov commented 5 months ago
antagomir commented 5 months ago

I agree to leave out imputation step.

So let us:

1) by default ignore NAs when calculating prevalence (NA = not detected w.r.t. prevalence calculation) 2) make sure the roxygen manpage, examples, and unit tests are clear.