guma44 / GEOparse

Python library to access Gene Expression Omnibus Database (GEO)
BSD 3-Clause "New" or "Revised" License
141 stars 52 forks source link

download fails because of content-length mismatch when content-encoding=gzip #80

Open vttrifonov opened 2 years ago

vttrifonov commented 2 years ago

get_GEO('GPL1641') fails with

OSError: Download failed due to 'Downloaded size do not match the expected size for http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL16417&form=text&view=full'. ID could be incorrect or the data might not be public yet.

The issue is that Downloader._download_http assumes that content-length is the same as the size before encoding. This is not the case when content-encoding=gzip because then content-length is the compressed size (i.e. after encoding/compression).

It is not clear how to get the size of the chunk before decoding/decompression unless you want to deal with the raw stream directly: it will be chunk_size, except for the last chunk... Might be best to drop the content-length enforcement.

bionewplayer commented 1 year ago

you can try this code "os.environ['GEOPARSE_USE_HTTP_FOR_FTP'] = 'yes'" before get_GEO('GPL1641')