seandavi / GEOquery

The bridge between the NCBI Gene Expression Omnibus and Bioconductor
http://seandavi.github.io/GEOquery/
Other
87 stars 36 forks source link

After current fixes with readr GEOquery fails to parse RNA-seq series #78

Closed vlakam closed 5 years ago

vlakam commented 5 years ago

Code

getGEO("GSE99709")

Output

getGEO("GSE99709") Found 1 file(s) GSE99709_series_matrix.txt.gz пробую URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE99nnn/GSE99709/matrix/GSE99709_series_matrix.txt.gz' Content type 'application/x-gzip' length 3389 bytes downloaded 3389 bytes

Parsed with column specification: cols() Ошибка в .subset2(x, i) :подгруппа выходит за пределы

Subscript is out of range

vlakam commented 5 years ago

Invalid. Master version actually parses 99709 well

vlakam commented 5 years ago

Actually error is reproducible with getGEO('GSE14308') which is not an rna-seq series

seandavi commented 5 years ago

Thanks, @vlakam. I can reproduce the problem and it looks like readr is not behaving as I had expected. Report filed with readr.

vlakam commented 5 years ago

Current readr fixes makes GSE14308 a bit more parseable. There are parse errors.

`> GEOquery::getGEO('GSE14308')
Found 1 file(s)
GSE14308_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE14nnn/GSE14308/matrix/GSE14308_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1807565 bytes (1.7 MB)
downloaded 1.7 MB

Parsed with column specification:
cols(
  ID_REF = col_character(),
  GSM357839 = col_double(),
  GSM357841 = col_double(),
  GSM357842 = col_double(),
  GSM357843 = col_double(),
  GSM357844 = col_double(),
  GSM357845 = col_double(),
  GSM357847 = col_double(),
  GSM357848 = col_double(),
  GSM357849 = col_double(),
  GSM357850 = col_double(),
  GSM357852 = col_double(),
  GSM357853 = col_double()
)
File stored at: 
/tmp/RtmpfbOUDh/GPL1261.soft
Warning: 64 parsing failures.
  row     col           expected    actual         file
45038 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45039 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45040 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45041 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45042 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.

$GSE14308_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 45101 features, 12 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total)
  varLabels: title geo_accession ... relation (33 total)
  varMetadata: labelDescription
featureData
  featureNames: 1415670_at 1415671_at ... AFFX-TrpnX-M_at (45101 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 19144320 
Annotation: GPL1261 `
 sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS: /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=ru_RU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2      GEOquery_2.51.4     Biobase_2.42.0      BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       tidyr_0.8.2      crayon_1.3.4     dplyr_0.7.8      assertthat_0.2.0 R6_2.3.0         magrittr_1.5    
 [8] pillar_1.3.1     rlang_0.3.0.1    curl_3.2         rstudioapi_0.8   limma_3.38.2     xml2_1.2.0       tools_3.5.1     
[15] readr_1.3.0.9000 glue_1.3.0       purrr_0.2.5      hms_0.4.2        yaml_2.2.0       compiler_3.5.1   pkgconfig_2.0.2 
[22] tidyselect_0.2.5 bindr_0.1.1      tibble_1.4.2  
vlakam commented 5 years ago

GSE53986 is suffering from parse errors too

Warning: 64 parsing failures.
  row     col           expected    actual         file
45038 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45039 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45040 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45041 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45042 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.
seandavi commented 5 years ago

These are warnings--annoying, but harmless. To fix them will slow parsing significantly if I stick with readr. I am contemplating switching over to fread that may be a better fit, but that won't happen immediately.

RichardJActon commented 4 years ago

I'm getting this error again with R version 3.5.2, GEOquery_2.51.5 and readr_1.3.1

getGEO("GSE80672") 
...
#> Parsed with column specification: 
#> cols()
#> Error in .subset2(x, i) : subscript out of bounds
rcastelo commented 4 years ago

hi, i find the same problem with this other GSE (GSE92506) both in release (3.6.2) and devel:

library(GEOquery)

getGEO("GSE92506")
Found 1 file(s)
GSE92506_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92506/matrix/GSE92506_series_matrix.txt.gz'
Content type 'application/x-gzip' length 3140 bytes
==================================================
downloaded 3140 bytes

Parsed with column specification:
cols()
Error: Can't subset columns that don't exist.
✖ The location 1 doesn't exist.
ℹ There are only 0 columns.
Run `rlang::last_error()` to see where the error occurred.
sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GEOquery_2.54.1     Biobase_2.46.0      BiocGenerics_0.32.0
[4] colorout_1.2-2     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       xml2_1.2.5       magrittr_1.5     hms_0.5.3       
 [5] tidyselect_1.0.0 R6_2.4.1         rlang_0.4.5      fansi_0.4.1     
 [9] dplyr_0.8.5      tools_3.6.2      cli_2.0.2        ellipsis_0.3.0  
[13] assertthat_0.2.1 tibble_3.0.0     lifecycle_0.2.0  crayon_1.3.4    
[17] purrr_0.3.3      readr_1.3.1      tidyr_1.0.2      vctrs_0.2.4     
[21] curl_4.3         glue_1.3.2       limma_3.42.2     stringi_1.4.6   
[25] compiler_3.6.2   pillar_1.4.3     pkgconfig_2.0.3 
ibrahimishag commented 3 years ago

Current readr fixes makes GSE14308 a bit more parseable. There are parse errors.

`> GEOquery::getGEO('GSE14308')
Found 1 file(s)
GSE14308_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE14nnn/GSE14308/matrix/GSE14308_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1807565 bytes (1.7 MB)
downloaded 1.7 MB

Parsed with column specification:
cols(
  ID_REF = col_character(),
  GSM357839 = col_double(),
  GSM357841 = col_double(),
  GSM357842 = col_double(),
  GSM357843 = col_double(),
  GSM357844 = col_double(),
  GSM357845 = col_double(),
  GSM357847 = col_double(),
  GSM357848 = col_double(),
  GSM357849 = col_double(),
  GSM357850 = col_double(),
  GSM357852 = col_double(),
  GSM357853 = col_double()
)
File stored at: 
/tmp/RtmpfbOUDh/GPL1261.soft
Warning: 64 parsing failures.
  row     col           expected    actual         file
45038 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45039 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45040 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45041 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45042 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.

$GSE14308_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 45101 features, 12 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total)
  varLabels: title geo_accession ... relation (33 total)
  varMetadata: labelDescription
featureData
  featureNames: 1415670_at 1415671_at ... AFFX-TrpnX-M_at (45101 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 19144320 
Annotation: GPL1261 `
 sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS: /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=ru_RU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2      GEOquery_2.51.4     Biobase_2.42.0      BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       tidyr_0.8.2      crayon_1.3.4     dplyr_0.7.8      assertthat_0.2.0 R6_2.3.0         magrittr_1.5    
 [8] pillar_1.3.1     rlang_0.3.0.1    curl_3.2         rstudioapi_0.8   limma_3.38.2     xml2_1.2.0       tools_3.5.1     
[15] readr_1.3.0.9000 glue_1.3.0       purrr_0.2.5      hms_0.4.2        yaml_2.2.0       compiler_3.5.1   pkgconfig_2.0.2 
[22] tidyselect_0.2.5 bindr_0.1.1      tibble_1.4.2  

I am experiencing the exact issue (i.e warning:64 parsing failures ) `> sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949 [4] LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] limma_3.46.0 GEOquery_2.58.0 Biobase_2.50.0 BiocGenerics_0.36.0

loaded via a namespace (and not attached): [1] rstudioapi_0.13 xml2_1.3.2 magrittr_2.0.1 hms_0.5.3 tidyselect_1.1.0 R6_2.5.0
[7] rlang_0.4.9 fansi_0.4.1 dplyr_1.0.2 tools_4.0.3 xfun_0.19 tinytex_0.28
[13] cli_2.2.0 ellipsis_0.3.1 assertthat_0.2.1 yaml_2.2.1 tibble_3.0.4 lifecycle_0.2.0 [19] crayon_1.3.4 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 vctrs_0.3.6 curl_4.3
[25] glue_1.4.2 compiler_4.0.3 pillar_1.4.7 generics_0.1.0 pkgconfig_2.0.3

`

seandavi commented 3 years ago

These are (annoying) warnings that you can ignore, @ibrahimishag.