kvittingseerup / IsoformSwitchAnalyzeR

An R package to Identify, Annoatate and Visialize Isoform Switches with Functional Consequences (from RNA-seq data)
96 stars 18 forks source link

Issue with importRdata() : the designMatrix is not full rank #247

Closed Tijs-dot closed 3 weeks ago

Tijs-dot commented 1 month ago

Hi,

Thanks a lot for developing this very extensive package! I keep running into this error during importRdata:

Step 5 of 10: Testing for unwanted effects... Error in importRdata(isoformCountMatrix = as.data.frame(counts), : The design matrix suggested by the "designMatrix" is not full rank and hence cannot be analyzed.

I know other people have had this issue, but solutions proposed there did not work for me.

I believe in my case it is because there is a check in your code that converts my covariate columns into factors, even though they should remain numeric. I believe it happens in this check: Line 4645 (and 4855) of https://github.com/kvittingseerup/IsoformSwitchAnalyzeR/blob/master/R/import_data.R#L4645

for(i in 3:ncol(localDesign) ) { # i <- 4
  if( class(localDesign[,i]) %in% c('numeric', 'integer') ) {
    if( uniqueLength( localDesign[,i] ) * 2 <= length(localDesign[,i]) ) {
      localDesign[,i] <- factor(localDesign[,i])
    }
  } else {
    localDesign[,i] <- factor(localDesign[,i])
  }
}
my designMatrix looks like this sampleID condition covariate_1(Sex) covariate_2 covariate_3
Sample1_Condition_A Condition_A F 2 3
Sample1_Condition_B Condition_B F 2 3
Sample2_Condition_A Condition_A M 5 6
Sample2_Condition_B Condition_B M 5 6
Sample3_Condition_A Condition_A F 2 8
Sample3_Condition_B Condition_B F 2 8

Since I have two conditions per sample, I have duplicate rows of covariates, since each sample will always have the same values for those (like age etc), despite there being two conditions from each sample. The covariate for sex I already provide as factor. However, due to the fact that there are some covariates that match across samples (e.g. samples 1 and 3 that have the same value for covariate 2), this:

uniqueLength( localDesign[,i] ) * 2 <= length(localDesign[,i])

Will be TRUE and will cause all my covariate columns to be converted into factors. Could this be the reason that it then says the designMatrix is not full rank?

Best wishes, Tijs

sessionInfo()
R version 4.3.3 (2024-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C tzcode source: system (glibc) attached base packages: [1] grid stats4 stats graphics grDevices utils datasets methods [9] base other attached packages: [1] VennDiagram_1.7.3 futile.logger_1.4.3 [3] conflicted_1.2.0 org.Hs.eg.db_3.18.0 [5] stageR_1.24.0 IsoformSwitchAnalyzeR_2.2.0 [7] pfamAnalyzeR_1.2.0 sva_3.50.0 [9] genefilter_1.84.0 mgcv_1.9-1 [11] nlme_3.1-164 satuRn_1.10.0 [13] DEXSeq_1.48.0 RColorBrewer_1.1-3 [15] AnnotationDbi_1.64.1 DESeq2_1.42.1 [17] SummarizedExperiment_1.32.0 GenomicRanges_1.54.1 [19] GenomeInfoDb_1.38.8 IRanges_2.36.0 [21] S4Vectors_0.40.2 MatrixGenerics_1.14.0 [23] matrixStats_1.3.0 Biobase_2.62.0 [25] BiocGenerics_0.48.1 BiocParallel_1.36.0 [27] limma_3.58.1 data.table_1.15.4 [29] lubridate_1.9.3 forcats_1.0.0 [31] stringr_1.5.1 dplyr_1.1.4 [33] purrr_1.0.2 readr_2.1.5 [35] tidyr_1.3.1 tibble_3.2.1 [37] ggplot2_3.5.1 tidyverse_2.0.0 loaded via a namespace (and not attached): [1] rstudioapi_0.16.0 jsonlite_1.8.8 [3] tximport_1.30.0 magrittr_2.0.3 [5] GenomicFeatures_1.54.4 rmarkdown_2.27 [7] BiocIO_1.12.0 zlibbioc_1.48.2 [9] vctrs_0.6.5 locfdr_1.1-8 [11] memoise_2.0.1 Rsamtools_2.18.0 [13] RCurl_1.98-1.14 htmltools_0.5.8.1 [15] S4Arrays_1.2.1 progress_1.2.3 [17] AnnotationHub_3.10.1 lambda.r_1.2.4 [19] curl_5.2.1 SparseArray_1.2.4 [21] sass_0.4.9 bslib_0.7.0 [23] plyr_1.8.9 futile.options_1.0.1 [25] cachem_1.1.0 GenomicAlignments_1.38.2 [27] mime_0.12 lifecycle_1.0.4 [29] pkgconfig_2.0.3 Matrix_1.6-5 [31] R6_2.5.1 fastmap_1.2.0 [33] shiny_1.8.1.1 GenomeInfoDbData_1.2.11 [35] digest_0.6.35 colorspace_2.1-0 [37] tximeta_1.20.3 geneplotter_1.80.0 [39] RSQLite_2.3.7 hwriter_1.3.2.1 [41] filelock_1.0.3 fansi_1.0.6 [43] timechange_0.3.0 httr_1.4.7 [45] abind_1.4-5 compiler_4.3.3 [47] bit64_4.0.5 withr_3.0.0 [49] DBI_1.2.3 biomaRt_2.58.2 [51] rappdirs_0.3.3 DelayedArray_0.28.0 [53] rjson_0.2.21 tools_4.3.3 [55] interactiveDisplayBase_1.40.0 httpuv_1.6.15 [57] glue_1.7.0 restfulr_0.0.15 [59] promises_1.3.0 reshape2_1.4.4 [61] generics_0.1.3 gtable_0.3.5 [63] BSgenome_1.70.2 tzdb_0.4.0 [65] ensembldb_2.26.0 hms_1.1.3 [67] xml2_1.3.6 utf8_1.2.4 [69] XVector_0.42.0 BiocVersion_3.18.1 [71] pillar_1.9.0 later_1.3.2 [73] splines_4.3.3 BiocFileCache_2.10.2 [75] lattice_0.22-6 survival_3.6-4 [77] rtracklayer_1.62.0 bit_4.0.5 [79] annotate_1.80.0 tidyselect_1.2.1 [81] locfit_1.5-9.9 Biostrings_2.70.3 [83] pbapply_1.7-2 knitr_1.47 [85] gridExtra_2.3 ProtGenerics_1.34.0 [87] edgeR_4.0.16 xfun_0.44 [89] statmod_1.5.0 stringi_1.8.4 [91] lazyeval_0.2.2 yaml_2.3.8 [93] boot_1.3-30 evaluate_0.23 [95] codetools_0.2-19 BiocManager_1.30.23 [97] cli_3.6.2 xtable_1.8-4 [99] munsell_0.5.1 jquerylib_0.1.4 [101] Rcpp_1.0.12 dbplyr_2.5.0 [103] png_0.1-8 XML_3.99-0.16.1 [105] parallel_4.3.3 blob_1.2.4 [107] prettyunits_1.2.0 AnnotationFilter_1.26.0 [109] bitops_1.0-7 scales_1.3.0 [111] crayon_1.5.2 rlang_1.1.4 [113] formatR_1.14 KEGGREST_1.42.0
kvittingseerup commented 3 weeks ago

Hi Tijs

Thanks for reaching out - and thanks for the work you have clearly done to investigate this problem.

Yes, you are correct - this is the exact source of the problem. The solution is simply that you convert your co-variate numbers into factors/characters yourself to indicate that they are groups instead of continuous variables. You can do that with as.character() or as.factor().

We realize this R peculiarity can be annoying but there is no way around it since we both want to be able to model discrete and continuous variables 🙂

@chunxubioinfor could you add a bit of text to the vignette and help file for the importRdata() where we talk about co-variates that numeric values should only be used for continuous variables.

Cheers Kristoffer

chunxubioinfor commented 3 weeks ago

@kvittingseerup OKi! I will take care of it! :)

Tijs-dot commented 3 weeks ago

Hi Kristoffer,

Thanks a bunch for the quick reply!

I believe I may have caused some confusion with the table I included. The covariates that I mean to include are actually continuous, and wrongly get converted into factors.

Let's take Age as an example. The problem for me is that they don't all have different values. Age may have the same value for multiple samples. This then also even gets more inflated because each sample is in there twice (which is a somewhat unusual setup). Then, in the check, the length of the unique values for Age * 2 actually is smaller than the full length of that covariate column.

In my understanding, this then causes let's say all samples with Age 50 to be coded as 0 and all with an Age that is not 50 as 1, and this is why my design is not full rank. I would like to avoid this conversion from a covariate that is in fact continuous into a factor.

Or maybe I did not understand your reply correctly.

Best wishes, Tijs