snakemake / snakemake-storage-plugin-http

Snakemake storage plugin for donwloading input files from HTTP(s).
MIT License
0 stars 2 forks source link

MissingInputException for Valid Downloadable URL #21

Closed FabianHofmann closed 1 month ago

FabianHofmann commented 5 months ago

I am encountering an unexpected error when using the storage plugin. I have the following link which downloads a xlsx file from the destatis data base (https://www.destatis.de/DE/Home/_inhalt.html):

"https://www.destatis.de/EN/Themes/Economy/Prices/Publications/Downloads-Energy-Price-Trends/energy-price-trends-xlsx-5619002.xlsx?__blob=publicationFile"

The link has no redirects and works properly when running it in the browser or in requests.get. However, when using it within the storage function, like in

rule retrieve_irena:
    input:
        storage(
            "https://www.destatis.de/EN/Themes/Economy/Prices/Publications/Downloads-Energy-Price-Trends/energy-price-trends-xlsx-5619002.xlsx?__blob=publicationFile",
        ),

the workflow throws the following error:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
MissingInputException in rule retrieve_irena in file /home/fabian/playground/snakemake-storage/Snakefile, line 1:
Missing input files for rule retrieve_irena:
    affected files:
        https://www.destatis.de/EN/Themes/Economy/Prices/Publications/Downloads-Energy-Price-Trends/energy-price-trends-xlsx-5619002.xlsx (storage)

I tried to understand what is going on, but could not resolve it. It seems to me like a bug, but perhaps I am missing a required setting.

Hugovdberg commented 5 months ago

It appears that there can be several causes that all result in the same MissingInputException, it could be an authentication issue (that happened to me today), but I suspect that in this case it is the ?__blob=publicationFile at the end that causes the issue. This URL for example seems to work just fine: http://wettelijkerente.net/wettelijkerente2.csv

Hugovdberg commented 5 months ago

ah no, I found the issue for your URL. snakemake uses requests.head to get some initial data from the file without downloading it in its entirety, but that returns an HTTP 303 status, which tells you to redirect elsewhere, but even following that redirect returns an HTTP 400 'Bad request'. So the assumption of snakemake is that every HTTP server supports both the HEAD and GET HTTP verbs, but that is not the case on this server.

I think the best way to fix this would be to add a configuration flag on the storage provider supports_http_head, which defaults to True, but can be set to False to use GET also to query the metadata. Alternatively, a allow_http_get_fallback flag could be created instead, which defaults to False, but when set to True would fall back to GET on certain HTTP status codes. However, it might be quite tricky to get the correct set of status codes, because I think the error 400 would actually be a code on which you would not retry with GET. Therefore the supports_http_head flag would seem to me to be a better approach. I will create a pull request to implement this shortly.

@johanneskoester is there a way to make the MissingInputException give more feedback for remote files, because once a network is involved there are a lot of reasons for the file to (temporarily) not be found, even for a valid resource.