seandavi / GEOquery

The bridge between the NCBI Gene Expression Omnibus and Bioconductor
http://seandavi.github.io/GEOquery/
Other
88 stars 36 forks source link

GSE27957 parsing failure #93

Closed bjmt closed 3 years ago

bjmt commented 5 years ago
GEOquery::getGEO("GSE27957")
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
#> Found 1 file(s)
#> GSE27957_series_matrix.txt.gz
#> Parsed with column specification:
#> cols()
#> Error in .subset2(x, i): indice hors limites

Investigating further led me to believe the error occurred at this call within GEOquery:::parseGSEMatrix():

tmpdat <- read.table(fname, sep = "\t", header = FALSE,
                     nrows = samples_header_row_count,
                     skip = sample_header_start - 1)

I downloaded the GSE27957_series_matrix.txt file from GEO and looked for what might be wrong. What I found was that the !Sample_data_processing entries had carriage return characters (\r) within each entry, for example:

!Sample_data_processing "probe group file: HuEx-1_0-st-v2.r2.pgf from Affymetrix
"   "probe group file: HuEx-1_0-st-v2.r2.pgf from Affymetrix
"   "probe group file: HuEx-1_0-st-v2.r2.pgf from Affymetrix
"   "probe group file: HuEx-1_0-st-v2.r2.pgf from Affymetrix
"   "probe group file: HuEx-1_0-st-v2.r2.pgf from Affymetrix
"   "probe group file: HuEx-1_0-st-v2.r2.pgf from Affymetrix
"   "probe group file: HuEx-1_0-st-v2.r2.pgf from Affymetrix
...

After going through and manually deleting these, getGEO() worked fine.

bjmt commented 5 years ago

Oops. Actually it would be more to accurate to say the error occurs at:

datamat <- read_tsv(fname, quote = "\"", na = c("NA", "null", "NULL", "Null"),
                    skip = series_table_begin_line,
                    comment = "!series_matrix_table_end",
                    skip_empty_rows = FALSE)
wilburnguo commented 4 years ago

sorry,can you ask me how to solve this proplem?

assaron commented 4 years ago

Have the same problem for GSE53258.

Apparently, the problem is in incorrect skip = series_table_begin_line value.

@seandavi this can be fixed by explicitly splitting lines by \n to dat:

    text <- readr::read_file(fname)
    dat <- strsplit(text, "\n", fixed=T)[[1]]

However, this breaks parsing of GSE781. It seems the behavior is different between read.table and read_tsv. May be switch to the latter completely?