seandavi / GEOquery

The bridge between the NCBI Gene Expression Omnibus and Bioconductor
http://seandavi.github.io/GEOquery/
Other
88 stars 36 forks source link

parsing failed--expected only one '!series_data_table_begin' #110

Closed zh-zhang1984 closed 3 years ago

zh-zhang1984 commented 3 years ago

The following code works well previously, but recently I found there is an error consistently exist and I cannot download these files now;

> GSE131761 <- getGEO(
+   "GSE131761",
+   destdir = '/Users/zhang/Documents/2021/singleCellRNA/Data',
+   AnnotGPL = T,GSEMatrix = T)
Found 1 file(s)
GSE131761_series_matrix.txt.gz
Using locally cached version: /Users/zhang/Documents/2021/singleCellRNA/Data/GSE131761_series_matrix.txt.gz
Error in parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL,  : 
  parsing failed--expected only one '!series_data_table_begin'

> GSE139913 <- getGEO("GSE139913",
+                   destdir = '/Users/zhang/Documents/2020/GEOsepsis/Data',
+                   AnnotGPL = T,GSEMatrix = T)
Found 1 file(s)
GSE139913_series_matrix.txt.gz
Using locally cached version: /Users/zhang/Documents/2020/GEOsepsis/Data/GSE139913_series_matrix.txt.gz
Error in parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL,  : 
  parsing failed--expected only one '!series_data_table_begin'
seandavi commented 3 years ago

Hi, @zh-zhang1984. I'm not able to reproduce the error. Can you double-check that you are using the latest version of R and GEOquery? If so, can you provide the output of sessionInfo() after loading GEOquery?

quanquan92 commented 3 years ago

Hi, @zh-zhang1984. I'm not able to reproduce the error. Can you double-check that you are using the latest version of R and GEOquery? If so, can you provide the output of sessionInfo() after loading GEOquery? Hi, I have the same problem. sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Simplified)_China.936 [2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936 [4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages: [1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] GEOquery_2.60.0 Biobase_2.52.0 BiocGenerics_0.38.0

loaded via a namespace (and not attached): [1] xml2_1.3.2 magrittr_2.0.1 hms_1.1.0
[4] bit_4.0.4 tidyselect_1.1.1 R6_2.5.0
[7] rlang_0.4.11 fansi_0.5.0 dplyr_1.0.7
[10] tools_4.1.0 vroom_1.5.4 utf8_1.2.2
[13] ellipsis_0.3.2 bit64_4.0.5 tibble_3.1.3
[16] lifecycle_1.0.0 crayon_1.4.1 BiocManager_1.30.16 [19] purrr_0.3.4 readr_2.0.0 tzdb_0.1.2
[22] tidyr_1.1.3 vctrs_0.3.8 curl_4.3.2
[25] glue_1.4.2 limma_3.48.1 compiler_4.1.0
[28] pillar_1.6.2 generics_0.1.0 pkgconfig_2.0.3 do you know why?thanks

seandavi commented 3 years ago

My suspicion is that the files that you have on your computer (notice that GEOquery is using a cached version of the file) are corrupted. Can you remove the files:

and try one more time? Sorry for the inconvenience.

quanquan92 commented 3 years ago

I googled this problem, there is a same suspicion. Someone Else did and the download worked, but I failed. I had cleared the folder and downloaded it again, but it still failed。 `Found 1 file(s) GSE131761_series_matrix.txt.gz trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE131nnn/GSE131761/matrix/GSE131761_series_matrix.txt.gz' Content type 'application/x-gzip' length 31912319 bytes (30.4 MB) downloaded 30.4 MB

Error in parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL, : parsing failed--expected only one '!series_data_table_begin'` do you know why?

quanquan92 commented 3 years ago

I tried to download another GSE data, but still failed.Thanks.

smallwerke commented 3 years ago

I am having the same issue. Ran sessionInfo, loaded the library and ran the get call (after previously receiving errors about VROOM_CONNECTION_SIZE):

library(GEOquery) Sys.setenv(VROOM_CONNECTION_SIZE=100000) gset <- getGEO('GSE2990', GSEMatrix = TRUE, getGPL = FALSE)

error message: Found 1 file(s) GSE2990_series_matrix.txt.gz trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2990/matrix/GSE2990_series_matrix.txt.gz' Content type 'application/x-gzip' length 16680570 bytes (15.9 MB) downloaded 15.9 MB

Error in parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL, : parsing failed--expected only one '!series_data_table_begin'

full output:

sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] locfit_1.5-9.4 Rcpp_1.0.7 lattice_0.20-44 png_0.1-7
[5] Biostrings_2.60.2 assertthat_0.2.1 utf8_1.2.2 R6_2.5.1
[9] GenomeInfoDb_1.28.1 stats4_4.1.0 RSQLite_2.2.7 httr_1.4.2
[13] ggplot2_3.3.5 pillar_1.6.2 zlibbioc_1.38.0 rlang_0.4.11
[17] rstudioapi_0.13 annotate_1.70.0 blob_1.2.2 S4Vectors_0.30.0
[21] Matrix_1.3-4 splines_4.1.0 BiocParallel_1.26.1 geneplotter_1.70.0
[25] RCurl_1.98-1.4 bit_4.0.4 munsell_0.5.0 DelayedArray_0.18.0
[29] compiler_4.1.0 pkgconfig_2.0.3 BiocGenerics_0.38.0 tidyselect_1.1.1
[33] KEGGREST_1.32.0 SummarizedExperiment_1.22.0 tibble_3.1.3 GenomeInfoDbData_1.2.6
[37] IRanges_2.26.0 matrixStats_0.60.0 XML_3.99-0.7 fansi_0.5.0
[41] crayon_1.4.1 dplyr_1.0.7 bitops_1.0-7 grid_4.1.0
[45] xtable_1.8-4 gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1
[49] magrittr_2.0.1 scales_1.1.1 cachem_1.0.5 XVector_0.32.0
[53] genefilter_1.74.0 ellipsis_0.3.2 vctrs_0.3.8 generics_0.1.0
[57] RColorBrewer_1.1-2 tools_4.1.0 bit64_4.0.5 Biobase_2.52.0
[61] glue_1.4.2 DESeq2_1.32.0 purrr_0.3.4 MatrixGenerics_1.4.2
[65] parallel_4.1.0 fastmap_1.1.0 survival_3.2-11 AnnotationDbi_1.54.1
[69] colorspace_2.0-2 BiocManager_1.30.16 GenomicRanges_1.44.0 memoise_2.0.0

library(GEOquery) Loading required package: Biobase Loading required package: BiocGenerics Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply,
parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval,
evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which.max, which.min

Welcome to Bioconductor

Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Setting options('download.file.method.GEOquery'='auto') Setting options('GEOquery.inmemory.gpl'=FALSE)

Sys.setenv(VROOM_CONNECTION_SIZE=100000) gset <- getGEO('GSE2990', GSEMatrix = TRUE, getGPL = FALSE) Found 1 file(s) GSE2990_series_matrix.txt.gz trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2990/matrix/GSE2990_series_matrix.txt.gz' Content type 'application/x-gzip' length 16680570 bytes (15.9 MB) downloaded 15.9 MB

Error in parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL, : parsing failed--expected only one '!series_data_table_begin'

sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] GEOquery_2.60.0 Biobase_2.52.0 BiocGenerics_0.38.0

loaded via a namespace (and not attached): [1] locfit_1.5-9.4 Rcpp_1.0.7 lattice_0.20-44 tidyr_1.1.3
[5] png_0.1-7 Biostrings_2.60.2 assertthat_0.2.1 utf8_1.2.2
[9] R6_2.5.1 GenomeInfoDb_1.28.1 stats4_4.1.0 RSQLite_2.2.7
[13] httr_1.4.2 ggplot2_3.3.5 pillar_1.6.2 zlibbioc_1.38.0
[17] rlang_0.4.11 curl_4.3.2 rstudioapi_0.13 annotate_1.70.0
[21] blob_1.2.2 S4Vectors_0.30.0 Matrix_1.3-4 splines_4.1.0
[25] BiocParallel_1.26.1 readr_2.0.1 geneplotter_1.70.0 RCurl_1.98-1.4
[29] bit_4.0.4 munsell_0.5.0 DelayedArray_0.18.0 compiler_4.1.0
[33] pkgconfig_2.0.3 tidyselect_1.1.1 KEGGREST_1.32.0 SummarizedExperiment_1.22.0 [37] tibble_3.1.3 GenomeInfoDbData_1.2.6 IRanges_2.26.0 matrixStats_0.60.0
[41] XML_3.99-0.7 fansi_0.5.0 tzdb_0.1.2 crayon_1.4.1
[45] dplyr_1.0.7 bitops_1.0-7 grid_4.1.0 xtable_1.8-4
[49] gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1
[53] scales_1.1.1 vroom_1.5.4 cachem_1.0.5 XVector_0.32.0
[57] genefilter_1.74.0 limma_3.48.3 xml2_1.3.2 ellipsis_0.3.2
[61] vctrs_0.3.8 generics_0.1.0 RColorBrewer_1.1-2 tools_4.1.0
[65] bit64_4.0.5 glue_1.4.2 DESeq2_1.32.0 purrr_0.3.4
[69] hms_1.1.0 MatrixGenerics_1.4.2 fastmap_1.1.0 survival_3.2-11
[73] AnnotationDbi_1.54.1 colorspace_2.0-2 BiocManager_1.30.16 GenomicRanges_1.44.0
[77] memoise_2.0.0

smallwerke commented 3 years ago

also got the same error as zh-zhang1984

getGEO("GSE131761", AnnotGPL = T,GSEMatrix = T) Found 1 file(s) GSE131761_series_matrix.txt.gz trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE131nnn/GSE131761/matrix/GSE131761_series_matrix.txt.gz' Content type 'application/x-gzip' length 31912319 bytes (30.4 MB) downloaded 30.4 MB

Error in parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL, : parsing failed--expected only one '!series_data_table_begin'

JoseMariHA commented 3 years ago

I face the same error after getting "Error: The size of the connection buffer (262144) was not large enough" issue. I found out that increasing the VROOM_CONNECTION_SIZE doubling it each time the error rises, instead of adding a very big number, avoids the "Error in parseGSEMatrix" error. Te exact command is the following:

Sys.setenv("VROOM_CONNECTION_SIZE" = 262144 * 2)

I hope this helps you if it is not too late. Best regards.

seandavi commented 3 years ago

Should be fixed in b81fe0aa.