Open artur-sannikov opened 5 months ago
Do you mean that pseudocount=TRUE
does not work if the data contains zero or missing values? I am not sure if I understand whether the proposed solution here is to replace pseudocount=TRUE
automated calculation with more lengthy manual calculation?
Do you mean that pseudocount=TRUE does not work if the data contains zero or missing values?
Sorry, I just realized I wrote 0 or missing values. I meant negative or missing values!
It will tell you that the data contains negative or missing values:
Error: The assay contains missing or negative values. 'pseudocount' must be specified manually.
I then manually calculate the required pseudocount value (which is the minimum value of relabundance / 2). We can skip this step if pseudocount=TRUE
calculates the minimum value or sets it to 0 even if we have missing or negative values like it's already done if there are not missing or negative values. That is what I do anyway in the code above.
I understand whether the proposed solution here is to replace pseudocount=TRUE automated calculation with more lengthy manual calculation?
Yes, sort of. The automatic calculation at the moment assumes that there are no negative or missing values and sets the pseudocount to the minimum value:
# If pseudocount TRUE, set it to non-zero minimum value, else set it to zero
pseudocount <- ifelse(pseudocount, min(mat[mat>0]), 0)
Ok, so let me gather the suggestion:
Correct me if I misunderstood it.
The problem with (2) is that there are no common use cases (that I am aware of) where negative values could be safely replaced with zeroes. Missing values perhaps a bit more safely, we could interpret them as missing (hence 0) observations but even that is a potentially risky assumption. In most use cases I expect that the user would want the code fail if pseudocount is added on negative or missing values. At least we should carefully consider what the expected use cases would be.
Also, I would not interpret missing values as 0s because 0 is not a missing value, it has a meaning in assays. A value that is missing in reality can be any number, so I agree that this assumption is risky.
It is a conscious design decision that we do not assume anything based on (ultimately user-defined) assay names. Hence we cannot give warnings for "counts" or "relabundance" assays based on their name. We would be only looking at the actual values in the data.
1) Pseudocounts are usually only applied on data sets with non-negative values. I am not aware of other applications at least in the microbiome context. Hence my current suggestion is that this function will throw an error if one will try to apply pseudocount on data that has negative values. Or is there an alternative suggestion?
2) Missing data, I think the user should decide what to do with that. We could in principle facilitate that e.g. by providing some imputation functions but they are also otherwise available, not sure about the added value.
In the case of my data, I tried to transform relative abundances into clr. But because I had some missing values, I got the error from my first message. My next step was to get the minimum non-zero non-missing value to use as pseudocount, so I thought if this can be calculated via pseudocount=TRUE
instead of going through manual process like I did.
But now I see, that my case was specific to my relabundance
assay and cannot be applicable to every possible situation / assay name.
Relative abundances are always non-negative and pseudocount=TRUE
should work but perhaps it doesn't when you have missing values.
My suggestion here is to check the following (without implying who would do this):
Does that make sense?
I agree to leave out imputation step.
So let us:
1) by default ignore NAs when calculating prevalence (NA = not detected w.r.t. prevalence calculation) 2) make sure the roxygen manpage, examples, and unit tests are clear.
Is your feature request related to a problem? Please describe.
At the moment,
transformAssay
function requires specifying a numerical value manually if there are negative or missing values in the assay.Describe the solution you'd like Even if we have missing values or zero values, it's possible to calculate the minimum non-NA value when
pseudocount=TRUE
. That is required anyway ifThe assay contains missing or negative values. "'pseudocount' must be specified manually
.My solution is simple:
The same solution can work also if there are negative values. We just need to add them in the logical vector.