Closed vlakam closed 5 years ago
Invalid. Master version actually parses 99709 well
Actually error is reproducible with
getGEO('GSE14308')
which is not an rna-seq series
Thanks, @vlakam. I can reproduce the problem and it looks like readr is not behaving as I had expected. Report filed with readr.
Current readr fixes makes GSE14308 a bit more parseable. There are parse errors.
`> GEOquery::getGEO('GSE14308')
Found 1 file(s)
GSE14308_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE14nnn/GSE14308/matrix/GSE14308_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1807565 bytes (1.7 MB)
downloaded 1.7 MB
Parsed with column specification:
cols(
ID_REF = col_character(),
GSM357839 = col_double(),
GSM357841 = col_double(),
GSM357842 = col_double(),
GSM357843 = col_double(),
GSM357844 = col_double(),
GSM357845 = col_double(),
GSM357847 = col_double(),
GSM357848 = col_double(),
GSM357849 = col_double(),
GSM357850 = col_double(),
GSM357852 = col_double(),
GSM357853 = col_double()
)
File stored at:
/tmp/RtmpfbOUDh/GPL1261.soft
Warning: 64 parsing failures.
row col expected actual file
45038 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45039 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45040 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45041 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45042 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.
$GSE14308_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 45101 features, 12 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total)
varLabels: title geo_accession ... relation (33 total)
varMetadata: labelDescription
featureData
featureNames: 1415670_at 1415671_at ... AFFX-TrpnX-M_at (45101 total)
fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 19144320
Annotation: GPL1261 `
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux
Matrix products: default
BLAS: /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=ru_RU.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=ru_RU.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=ru_RU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 GEOquery_2.51.4 Biobase_2.42.0 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 tidyr_0.8.2 crayon_1.3.4 dplyr_0.7.8 assertthat_0.2.0 R6_2.3.0 magrittr_1.5
[8] pillar_1.3.1 rlang_0.3.0.1 curl_3.2 rstudioapi_0.8 limma_3.38.2 xml2_1.2.0 tools_3.5.1
[15] readr_1.3.0.9000 glue_1.3.0 purrr_0.2.5 hms_0.4.2 yaml_2.2.0 compiler_3.5.1 pkgconfig_2.0.2
[22] tidyselect_0.2.5 bindr_0.1.1 tibble_1.4.2
GSE53986 is suffering from parse errors too
Warning: 64 parsing failures.
row col expected actual file
45038 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45039 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45040 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45041 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
45042 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
..... ....... .................. ......... ............
See problems(...) for more details.
These are warnings--annoying, but harmless. To fix them will slow parsing significantly if I stick with readr. I am contemplating switching over to fread
that may be a better fit, but that won't happen immediately.
I'm getting this error again with R version 3.5.2, GEOquery_2.51.5 and readr_1.3.1
getGEO("GSE80672")
...
#> Parsed with column specification:
#> cols()
#> Error in .subset2(x, i) : subscript out of bounds
hi, i find the same problem with this other GSE (GSE92506) both in release (3.6.2) and devel:
library(GEOquery)
getGEO("GSE92506")
Found 1 file(s)
GSE92506_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92506/matrix/GSE92506_series_matrix.txt.gz'
Content type 'application/x-gzip' length 3140 bytes
==================================================
downloaded 3140 bytes
Parsed with column specification:
cols()
Error: Can't subset columns that don't exist.
✖ The location 1 doesn't exist.
ℹ There are only 0 columns.
Run `rlang::last_error()` to see where the error occurred.
sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GEOquery_2.54.1 Biobase_2.46.0 BiocGenerics_0.32.0
[4] colorout_1.2-2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4 xml2_1.2.5 magrittr_1.5 hms_0.5.3
[5] tidyselect_1.0.0 R6_2.4.1 rlang_0.4.5 fansi_0.4.1
[9] dplyr_0.8.5 tools_3.6.2 cli_2.0.2 ellipsis_0.3.0
[13] assertthat_0.2.1 tibble_3.0.0 lifecycle_0.2.0 crayon_1.3.4
[17] purrr_0.3.3 readr_1.3.1 tidyr_1.0.2 vctrs_0.2.4
[21] curl_4.3 glue_1.3.2 limma_3.42.2 stringi_1.4.6
[25] compiler_3.6.2 pillar_1.4.3 pkgconfig_2.0.3
Current readr fixes makes GSE14308 a bit more parseable. There are parse errors.
`> GEOquery::getGEO('GSE14308') Found 1 file(s) GSE14308_series_matrix.txt.gz trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE14nnn/GSE14308/matrix/GSE14308_series_matrix.txt.gz' Content type 'application/x-gzip' length 1807565 bytes (1.7 MB) downloaded 1.7 MB Parsed with column specification: cols( ID_REF = col_character(), GSM357839 = col_double(), GSM357841 = col_double(), GSM357842 = col_double(), GSM357843 = col_double(), GSM357844 = col_double(), GSM357845 = col_double(), GSM357847 = col_double(), GSM357848 = col_double(), GSM357849 = col_double(), GSM357850 = col_double(), GSM357852 = col_double(), GSM357853 = col_double() ) File stored at: /tmp/RtmpfbOUDh/GPL1261.soft Warning: 64 parsing failures. row col expected actual file 45038 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data 45039 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data 45040 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data 45041 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data 45042 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data ..... ....... .................. ......... ............ See problems(...) for more details. $GSE14308_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 45101 features, 12 samples element names: exprs protocolData: none phenoData sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total) varLabels: title geo_accession ... relation (33 total) varMetadata: labelDescription featureData featureNames: 1415670_at 1415671_at ... AFFX-TrpnX-M_at (45101 total) fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total) fvarMetadata: Column Description labelDescription experimentData: use 'experimentData(object)' pubMedIds: 19144320 Annotation: GPL1261 `
sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Manjaro Linux Matrix products: default BLAS: /usr/lib/libblas.so.3.8.0 LAPACK: /usr/lib/liblapack.so.3.8.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=ru_RU.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=ru_RU.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=ru_RU.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2.2 GEOquery_2.51.4 Biobase_2.42.0 BiocGenerics_0.28.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.0 tidyr_0.8.2 crayon_1.3.4 dplyr_0.7.8 assertthat_0.2.0 R6_2.3.0 magrittr_1.5 [8] pillar_1.3.1 rlang_0.3.0.1 curl_3.2 rstudioapi_0.8 limma_3.38.2 xml2_1.2.0 tools_3.5.1 [15] readr_1.3.0.9000 glue_1.3.0 purrr_0.2.5 hms_0.4.2 yaml_2.2.0 compiler_3.5.1 pkgconfig_2.0.2 [22] tidyselect_0.2.5 bindr_0.1.1 tibble_1.4.2
I am experiencing the exact issue (i.e warning:64 parsing failures ) `> sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale: [1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949 [4] LC_NUMERIC=C LC_TIME=Korean_Korea.949
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] limma_3.46.0 GEOquery_2.58.0 Biobase_2.50.0 BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.13 xml2_1.3.2 magrittr_2.0.1 hms_0.5.3 tidyselect_1.1.0 R6_2.5.0
[7] rlang_0.4.9 fansi_0.4.1 dplyr_1.0.2 tools_4.0.3 xfun_0.19 tinytex_0.28
[13] cli_2.2.0 ellipsis_0.3.1 assertthat_0.2.1 yaml_2.2.1 tibble_3.0.4 lifecycle_0.2.0
[19] crayon_1.3.4 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 vctrs_0.3.6 curl_4.3
[25] glue_1.4.2 compiler_4.0.3 pillar_1.4.7 generics_0.1.0 pkgconfig_2.0.3
`
These are (annoying) warnings that you can ignore, @ibrahimishag.
Code
getGEO("GSE99709")
Output
getGEO("GSE99709") Found 1 file(s) GSE99709_series_matrix.txt.gz пробую URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE99nnn/GSE99709/matrix/GSE99709_series_matrix.txt.gz' Content type 'application/x-gzip' length 3389 bytes downloaded 3389 bytes
Parsed with column specification: cols() Ошибка в .subset2(x, i) :подгруппа выходит за пределы
Subscript is out of range