pepkit / geofetch

Builds a PEP from SRA or GEO accessions
https://pep.databio.org/geofetch/
BSD 2-Clause "Simplified" License
46 stars 5 forks source link

Add file size filter for processed files #44

Closed khoroshevskyi closed 2 years ago

khoroshevskyi commented 2 years ago

Possibility of filtering processed files by size before downloading can be very useful option in geofetch.

nsheff commented 2 years ago

do you have access to the file size of the file so you can filter before downloading?

khoroshevskyi commented 2 years ago

do you have access to the file size of the file so you can filter before downloading?

Yes, for each ".tar" file in the same directory is additional file "filelist.txt" ,which contains the information about files in it. e.g. for the Series GSE152804 all the files and "filelist.txt" can be find here:

nsheff commented 2 years ago

excellent. you can make it understand a parameter in gigabytes that would be useful. Or you could parse "GB" and "MB" and have them pass as "5 GB" or "500 MB"

khoroshevskyi commented 2 years ago

MB and GB possibility great idea! However there is one other problem. There are also file that could be not in .tar archive and therefore they are not listed in the "fileslist". I see only one solution in this case: 1) Download index.html file (information about repository) 2) Retrieve size of the files from html document

nsheff commented 2 years ago

I would say don't bother parsing HTML. If the data is not readily available, let's not do this, it's a small benefit.

khoroshevskyi commented 2 years ago

Added --filter-size argument: Fixed #50

nsheff commented 2 years ago

Added --filter-size argument: Fixed #50

Are you referring to the right issue here?

khoroshevskyi commented 2 years ago

Was fixed in the v0.8.0