seandavi / GEOquery

The bridge between the NCBI Gene Expression Omnibus and Bioconductor
http://seandavi.github.io/GEOquery/
Other
88 stars 36 forks source link

Annotating `!Sample` headers fails #62

Closed kalugny closed 6 years ago

kalugny commented 6 years ago

$ GEOquery::getGEO('GSE30134')


Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 21 did not have 39 elements
Traceback:

1. GEOquery::getGEO("GSE30134")
2. getAndParseGSEMatrices(GEO, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL, 
 .     parseCharacteristics = parseCharacteristics)
3. parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL, 
 .     getGPL = getGPL, parseCharacteristics = parseCharacteristics)
4. read.table(textConnection(grep("^!Sample_", dat, value = TRUE)), 
 .     sep = "\t", header = FALSE)
5. scan(file = file, what = what, sep = sep, quote = quote, dec = dec, 
 .     nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE, 
 .     fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip, 
 .     multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes, 
 .     flush = flush, encoding = encoding, skipNul = skipNul)```
kalugny commented 6 years ago

In this specific file, there are long lines which are split across a few headers

kalugny commented 6 years ago

I think generally it's better that the function that process the files won't fail completely when there's a problem parsing meta-data, as it sometimes messy in GEO, and you still won't the data itself. This is relevant for #59 as well

seandavi commented 6 years ago

Just a followup note:

The parsing problems were due to the fact that read.table (and read_tsv) treat line endings differently when "skipping" the lines versus actually reading them. All the work was in just a few lines of the header and sample header parsing of parseGSEMatrix. I think we should be back to normal on that front.