seandavi / GEOquery

The bridge between the NCBI Gene Expression Omnibus and Bioconductor
http://seandavi.github.io/GEOquery/
Other
88 stars 36 forks source link

getGEO() incorrectly parses GSE6691_series_matrix.txt.gz #94

Closed khughitt closed 3 years ago

khughitt commented 5 years ago

Overview

When using GEOquery::getGEO() to download a particular dataset (GSE6691), the resulting eset skips the first ~700 lines of the expression matrix and uses a data row in place of a header column.

To reproduce

library(GEOquery)
accession <- 'GSE6691'
eset <- getGEO(accession, destdir = '/tmp')[[1]]

dim(eset)                                                                                  
# Features  Samples                                                                            
#   21667       56                                                                            

head(colnames(eset))                                                                       
# [1] "5.82592"  "6.799907" "5.388776" "6.630125" "7.093243" "6.426071" 

exprs(eset)[1:3, 1:3]
#             5.82592 6.799907 5.388776    
# 201089_at   5.743985 5.438670 5.696575    
# 201090_x_at 9.340066 9.029688 9.596852    
# 201091_s_at 5.290541 5.463729 5.512379    

The place in the file where the matrix appears to start from:

$ zcat GSE6691_series_matrix.txt.gz| sed -n '706,708p' | cut  -f1-5                                                                                                                   
"201088_at" 5.82592 6.799907    5.388776    6.630125
"201089_at" 5.743985    5.43867 5.696575    6.014609
"201090_x_at"   9.340066    9.029688    9.596852    9.548305

System info

> sessionInfo()                                                                                                                                                                                
R version 3.6.1 (2019-07-05)                                                                                                                                                                   
Platform: x86_64-pc-linux-gnu (64-bit)                                                                                                                                                         
Running under: Arch Linux                                                                                                                                                                      

Matrix products: default                                                                                                                                                                       
BLAS:   /usr/lib/libopenblasp-r0.3.6.so                                                                                                                                                        
LAPACK: /usr/lib/liblapack.so.3.8.0                                                                                                                                                            

locale:                                                                                                                                                                                        
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8                                                                                                                 
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8                                                                                                             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C                                                                                                                        
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C                                                                                                                 

attached base packages:                                                                                                                                                                        
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base                                                                                                                 

other attached packages:                                                                                                                                                                       
[1] GEOquery_2.51.5     Biobase_2.44.0      BiocGenerics_0.30.0 nvimcom_0.9-82                                                                                                                 
[5] colorout_1.2-0                                                                                                                                                                             

loaded via a namespace (and not attached):                                                                                                                                                     
 [1] Rcpp_1.0.1         xml2_1.2.0         magrittr_1.5       hms_0.5.0          tidyselect_0.2.5                                                                                              
 [6] R6_2.4.0           rlang_0.4.0        dplyr_0.8.3        tools_3.6.1        txtplot_1.0-3                                                                                                 
[11] assertthat_0.2.1   tibble_2.1.3       crayon_1.3.4       BiocManager_1.30.4 purrr_0.3.2                                                                                                   
[16] readr_1.3.1        tidyr_0.8.3        vctrs_0.2.0        curl_4.0           zeallot_0.1.0                                                                                                 
[21] glue_1.3.1         limma_3.40.4       compiler_3.6.1     pillar_1.4.2       backports_1.1.4                                                                                               
[26] pkgconfig_2.0.2        

https://app.leanboard.io/board/23df0a6c-c0a8-4280-9428-6cf29005870d

assaron commented 4 years ago

Also fixed by #101:

> dim(eset) 
Features  Samples 
   22283       56