Closed khughitt closed 3 years ago
Overview
When using GEOquery::getGEO() to download a particular dataset (GSE6691), the resulting eset skips the first ~700 lines of the expression matrix and uses a data row in place of a header column.
GEOquery::getGEO()
To reproduce
library(GEOquery) accession <- 'GSE6691' eset <- getGEO(accession, destdir = '/tmp')[[1]] dim(eset) # Features Samples # 21667 56 head(colnames(eset)) # [1] "5.82592" "6.799907" "5.388776" "6.630125" "7.093243" "6.426071" exprs(eset)[1:3, 1:3] # 5.82592 6.799907 5.388776 # 201089_at 5.743985 5.438670 5.696575 # 201090_x_at 9.340066 9.029688 9.596852 # 201091_s_at 5.290541 5.463729 5.512379
The place in the file where the matrix appears to start from:
$ zcat GSE6691_series_matrix.txt.gz| sed -n '706,708p' | cut -f1-5 "201088_at" 5.82592 6.799907 5.388776 6.630125 "201089_at" 5.743985 5.43867 5.696575 6.014609 "201090_x_at" 9.340066 9.029688 9.596852 9.548305
System info
> sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux Matrix products: default BLAS: /usr/lib/libopenblasp-r0.3.6.so LAPACK: /usr/lib/liblapack.so.3.8.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GEOquery_2.51.5 Biobase_2.44.0 BiocGenerics_0.30.0 nvimcom_0.9-82 [5] colorout_1.2-0 loaded via a namespace (and not attached): [1] Rcpp_1.0.1 xml2_1.2.0 magrittr_1.5 hms_0.5.0 tidyselect_0.2.5 [6] R6_2.4.0 rlang_0.4.0 dplyr_0.8.3 tools_3.6.1 txtplot_1.0-3 [11] assertthat_0.2.1 tibble_2.1.3 crayon_1.3.4 BiocManager_1.30.4 purrr_0.3.2 [16] readr_1.3.1 tidyr_0.8.3 vctrs_0.2.0 curl_4.0 zeallot_0.1.0 [21] glue_1.3.1 limma_3.40.4 compiler_3.6.1 pillar_1.4.2 backports_1.1.4 [26] pkgconfig_2.0.2
https://app.leanboard.io/board/23df0a6c-c0a8-4280-9428-6cf29005870d
Also fixed by #101:
> dim(eset) Features Samples 22283 56
Overview
When using
GEOquery::getGEO()
to download a particular dataset (GSE6691), the resulting eset skips the first ~700 lines of the expression matrix and uses a data row in place of a header column.To reproduce
The place in the file where the matrix appears to start from:
System info
https://app.leanboard.io/board/23df0a6c-c0a8-4280-9428-6cf29005870d