isglobal-brge / rexposome

Bioconductor package to characterize, analyze and integrate exposome with omic and disease data
MIT License
11 stars 1 forks source link

Number of rows of matrices must match - standardize() step #3

Closed camilafarias112 closed 3 years ago

camilafarias112 commented 3 years ago

Hello developers,

I have using rexposome a lot in my projects, so in advance, I would like to express how useful is your workflow. However, I got stuck recently in the standardize() step for one of my analyses, and I haven't had this issue before. Would you guys have a clue of what can be happening?

I have the Robject exp_down

Screen Shot 2021-07-06 at 2 55 29 PM

With inputs of dimensions:

Screen Shot 2021-07-06 at 3 05 36 PM

the exp_down is post-imputation step, which I've already investigated.

When I: exp_std_down <- standardize(exp_down, method = "normal") I get the following error message:

Screen Shot 2021-07-06 at 3 07 51 PM

What are the matrices that the error is referring to?

I would very much appreciate the help. Thank you!

Camila Farias Amorim

ESCRI11 commented 3 years ago

Hello Camila,

In order to be able to look into this issue we would appreciate to have a reproducible example to be able to solve it swiftly. If it's possible please ananoymize your data, alter all the values and send it next to the previous pipeline.

If that would not be possible I will need some collaboration from your side. I have extracted the code that is run inside the standardize() function.

It looks like the error you are getting comes from the last line of this chunk, the cbind() function expects two matrices of equal length, and it appears not to be the case. Please report back the results of dim(dd) , dim(t(assayData(object)[["exp"]][select.no, ])) and length(select.no).

Hopefully this information will shed some light on what is causing the problem.

object <- exp_std_down

select <- exposureNames(object)
select.no <- exposureNames(object)[!exposureNames(object) %in% select]

if(sum(fData(object)[select, ".type"] == "factor") != 0) {
            if(warnings) {
                warning("Given categorical exposures.")
            }
            select.no <- c(select.no,
                select[fData(object)[select, ".type"] == "factor"]
            )
            select <- select[
                fData(object)[select, ".type"] != "factor"
            ]
        }

dd <- expos(object)[ , select, drop=FALSE]

center <- apply(dd, 2, mean, na.rm = na.rm)
vari   <- apply(dd, 2, sd, na.rm = na.rm)

dd <- apply(dd, 2, function(x) as.numeric(as.character(x)))
dd <- scale(dd, center = center, scale = vari)
dd <- cbind(dd,
            t(assayData(object)[["exp"]][select.no, ]))

Xavier Escribà Montagut

camilafarias112 commented 3 years ago

Hello Xavier,

Thank you for the quick reply. Before I send you a reproducible example, I believe I found what is different from my previous successful analysis and this one. In exp_down object, 1 out of 761 exposures was classified as a categorical variable (see pic attached previously). And I can see this in the "exposures description" when investigating the ExposomeSet object.

Would you know a way that I could extract the information of which one of my 761 was classified as categorical? Just having a hard time finding this abnormality in my exposures input. I know that should not have categorical exposures. It all should be continuous.

I see in the error message that categorical variables in the exposure object won't be standardized, and maybe this 1 exposure out is not matching with the description matrix (maybe).

Again: thank you!

Camila

camilafarias112 commented 3 years ago

Hi Xavier,

I found the issue. One of my exposures "KRT28" was considered a factor instead of numeric when creating the ExposomeSet object. Here you will see the values for each sample of mine (rows).

Screen Shot 2021-07-07 at 1 05 26 PM

And I found this by extracting with: cbind(exp_down@featureData@data[[".type"]],rownames(exp_down@assayData[["exp"]]))

Screen Shot 2021-07-07 at 1 12 01 PM

I was expecting this exposure to be continuous, even though the distribution was poor. It was interesting that when investigating the normality with nm_down <- normalityTest(exp_down), only the 760 other exposures were tested. And that's why I couldn't figure out which one of the exposures was the problem.

The standardize() function could not succeed because I had KRT28 in my phenotype input, but it was not tested for normality. This exposure didn't have TRUE/FALSE, so therefore the function didn't work.

Screen Shot 2021-07-07 at 1 16 34 PM

= total of 760, not 761.

When I absolutely removed KRT28 from the entire pipeline, it all worked.

I hope I could explain it enough, please let me know if that helps. Thank you so much for sending the description of the function!

All the best,

Camila

ESCRI11 commented 3 years ago

Hello Camila,

Some remarks regarding your inputs and questions:

  1. You can extract the feature data by using the function Biobase::fData(exp_std_down). This will yield a table such as:
                        Family                                                             Name .fct .trn .std .imp   .type
AbsPM25         Air Pollutants                    Measurement of the blackness of PM2.5 filters                     numeric
As                      Metals                                                           Asenic                     numeric
BDE100                   PBDEs                               Polybrominated diphenyl ether -100                     numeric

On which you will be able to see the type of each of your exposures.

  1. It looks like you have a variable with not many unique levels, you can tune the argument exposures.asFactor of the function rexposome::loadExposome (or rexposome::readExposome). The definition of this argument (extracted from the documentation) is: (default 5) The exposures with more than this number of unique items will be considered as "continuous" while the exposures with less or equal number of items will be considered as "factor". Maybe tuning this argument could also solve your problem without the need of removing this exposure.
  2. The fact that you have categorical variables on your exposomeSet does not mean you are not able to use the function rexposome::standardize, it should work perfectly but only modify the numerical exposures (as one may expect!). An example about that using the test data (4 categorical exposures and 84 continuous exposures) bundled with the package is the following:
library(rexposome)

path <- file.path(path.package("rexposome"), "extdata")
description <- file.path(path, "description.csv")
phenotype <- file.path(path, "phenotypes.csv")
exposures <- file.path(path, "exposures.csv")

exp <- readExposome(exposures = exposures, description = description, phenotype = phenotype, 
                    exposures.samCol = "idnum", description.expCol = "Exposure", description.famCol = "Family", 
                    phenotype.samCol = "idnum")

standardize(exp, method = "normal")

For that reason, I would value a lot if you could take the time to anonymize completely your data and send it to me so I can perform a test, there may be some bug I'm missing as you should not have any problem running the pipeline with a categorical variable.

Nevertheless, I will close the issue for now as I see you are able to continue with your analysis.

Thanks for your report,

Xavier.

ESCRI11 commented 3 years ago

@camilafarias112 There is no need anymore for you to send data for testing purposes. I found out that the actual issue is the fact that there is only ONE categorical exposure, it works perfectly when more than one is present. I already solved that on the latest commit 1f5d1ac. I will upload this fix to Bioconductor tomorrow.

Thanks for pointing out the issue and helping to solve it!

Xavier.

camilafarias112 commented 3 years ago

That is absolutely great! Happy to help, I'm a big fan and user of your package.

Thank you for all the help. All the best, Camila.