pepkit / geofetch

Builds a PEP from SRA or GEO accessions
https://pep.databio.org/geofetch/
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Disable downloading huge soft files #102

Closed khoroshevskyi closed 1 year ago

khoroshevskyi commented 1 year ago

Some of the soft files are bigger then 10 MB. I think we should disable downloading them if particular argument is not set.

The information about soft files can be find here: e.g. https://ftp.ncbi.nlm.nih.gov/geo/series/GSE199nnn/GSE199233/soft/

By using 'requests.head' get information about the size of the file. And fail it if neccessury.

khoroshevskyi commented 1 year ago

@nsheff We can't get file size information from API (head request). What we can do we can parse website page using e.g. beautifulsoup . But we were talking about it few month ago, and decision was not to do it.

nsheff commented 1 year ago

No, don't scrape the website. Just construct the http url to the file, and then HEAD it.

e.g.:

curl -I https://ftp.ncbi.nlm.nih.gov/geo/series/GSE107nnn/GSE107227/soft/GSE107227_family.soft.gz
HTTP/1.1 200 OK
Date: Wed, 30 Nov 2022 18:53:19 GMT
Server: Apache
Last-Modified: Fri, 04 Nov 2022 05:17:32 GMT
ETag: "75b-5ec9e30d9cf04"
Accept-Ranges: bytes
Content-Length: 1883
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET,POST,PUT,OPTIONS
Access-Control-Allow-Headers: RANGE, Cache-control, If-None-Match, Content-Type
Access-Control-Expose-Headers: Content-Length, Content-Range, Content-Type
Content-Type: application/x-gzip
khoroshevskyi commented 1 year ago

Yes, but there is no information about size of the file

nsheff commented 1 year ago

Yes there is, it's under Content-Length: 1883

That is file size in bytes.

khoroshevskyi commented 1 year ago

ohh, I see, my bad