seandavi / GEOquery

The bridge between the NCBI Gene Expression Omnibus and Bioconductor
http://seandavi.github.io/GEOquery/
Other
88 stars 36 forks source link

Large number of parsing failures #69

Closed FarzanT closed 3 years ago

FarzanT commented 6 years ago

Hello, I've been having issues accessing CNV data from a particular dataset:

prostate_cnv <-
    GEOquery::getGEO(
        GEO = "GSE73012",
        destdir = "GEO_Files/",
        GSEMatrix = T,
        parseCharacteristics = T
    )

The following is the output I get when I run the above command:

> prostate_cnv <- +   
  GEOquery::getGEO( + 
        GEO = "GSE73012", +   
      destdir = "GEO_Files/", +
         GSEMatrix = T, +
         parseCharacteristics = T +
     ) 
Found 1 file(s) GSE73012_series_matrix.txt.gz trying URL 
'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE73nnn/GSE73012/matrix/GSE73012_series_matrix.txt.gz' 
Content type 'application/x-gzip' length 3742 bytes 
================================================== downloaded 3742 bytes  Parsed with 
column specification: cols(   ID_REF = col_character(),   GSM1880761 = col_character(),   GSM1880762
 = col_character(),   GSM1880763 = col_character(),   GSM1880764 = col_character(),   GSM1880765 =
 col_character(),   GSM1880766 = col_character(),   GSM1880767 = col_character() ) File stored at:  
GEO_Files//GPL16104.soft 
\|==============================================================================
==========================================\| 100%   78 MB Warning: 58443 parsing failures. 
row # A tibble: 5 x 5 col       row col   expected   actual file         expected     <int> <chr> <chr>      <chr>  
<chr>        actual 1 2321413 Chr   an integer MT     literal data file 2 2321414 Chr   an integer MT     literal 
data row 3 2321415 Chr   an integer MT     literal data col 4 2321416 Chr   an integer MT     literal data 
expected 5 2321417 Chr   an integer MT     literal data ... ................................. ... 
.............................................. ........ 
................................................................................................................................. [... truncated] Warning 
message: In rbind(names(probs), probs_f) :
   number of columns of result is not a multiple of vector 
length (arg 1)

What could be the issue?

Thank you for your time

seandavi commented 6 years ago

The issue is that the X, Y, and XY chromosomes are not numeric (obviously). I am using readr to read the data and readr uses the first few rows to guess the column types. In this dataset, the chromosomes are numeric except for the last chromosomes. This is just a warning, so you'll notice that the GEO record is returned.

That said, there are no data in this particular GSEMatrix, as the submitter did not include processed data. You'll need to download and then process the raw data files to get results. The phenotype data that you got in the GEO record from GEOquery is still useful for attaching to the data after you process them.

seandavi commented 3 years ago

GEOquery no longer gives a warning here.