seandavi / GEOquery

The bridge between the NCBI Gene Expression Omnibus and Bioconductor
http://seandavi.github.io/GEOquery/
Other
88 stars 36 forks source link

Add check for incomplete GPL files #112

Closed khughitt closed 3 years ago

khughitt commented 3 years ago

Recently, I noticed some significant changes in the output from a pipeline. I traced it back to the GPL files retrieved by geoGEO(). At some point, I ended up with a bunch of partially-downloaded GPL files, which getGEO was re-using.

It should be easy to check for these and re-download them / refuse to work with invalid files (a check for !platform_table_end should be sufficient for most cases).

I don't have the time right now to tackle this, but I'll try and come back to it in the future when I have some free time. Wanted to report the issue now so that others are aware in the meantime.

khughitt commented 3 years ago

Hmm. Just realized, in a couple cases, this is occurring for the data as well (i.e. corrupt/incomplete files previously downloaded are being reused..)

The gzip file is clearly truncated ("gzip: unexpected end of file"), and getGEO() emits a warning when it encounters the last line with fewer columns than expected, e.g.:

Warning: 1 parsing failure.
 row col     expected      actual                                                                  file
6104  -- 1039 columns 123 columns

Since it's possible that a truncation could occur in between lines, however, it is probably better to check with gzip instead of relying on column differences.

I haven't noticed checksum files anywhere on the GEO ftp, but, if such things exist, that would be another option to ensure file integrity.

khughitt commented 3 years ago

I realized a simple check for exceptions/non-zero status codes should do the trick.. submitted a PR.