moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.15k stars 1.15k forks source link

Additional User-Agent HTTPOption for downloads #4513

Open gliwka opened 10 months ago

gliwka commented 10 months ago

I'm currently using BuildKit to build various binary packages. This involves downloading sources from all over the web using HTTP. Unfortunately, some of those locations handle requests based on User-Agent and serve common command-line user agents 302 redirects to the actual files, while serving HTML pages to the rest.

A prominent example is Sourceforge. While an automated download is possible using wget, other user agents (including BuildKit) receive an HTML-based redirect instead of an 302 to the actual file.

Would you be willing to accept a pull request that implements an additional HTTPOption that enables users to specify the User-Agent as part of the download llb (similar to the checksum one)? Or alternatively, a global option to override the User-Agent being used?

jedevc commented 9 months ago

Heya :wave:

A prominent example is Sourceforge. While an automated download is possible using wget, other user agents (including BuildKit) receive an HTML-based redirect instead of an 302 to the actual file.

:cry: this is bizarre behavior for a site, the user agent shouldn't completely change the type of response imho. But, oh well, people do it anyways :smile:

Would you be willing to accept a pull request that implements an additional HTTPOption that enables users to specify the User-Agent as part of the download llb (similar to the checksum one)? Or alternatively, a global option to override the User-Agent being used?

I think this makes sense, I'd be happy to take this I think! cc @tonistiigi @thaJeztah (since this would probably need to be exposed as --user-agent or similar in the dockerfile.

I think we'd need this as part of the LLB. As a global option would be a bit strange since 1. this makes builds unportable (e.g. if my dockerfile downloads from sourceforge, then it's not a good user experience if I need to configure buildkit to do that), and 2. it would apply globally, with no real way for any build to opt out.

gliwka commented 9 months ago

Sounds good :-) I‘ll implement it as additional llb HttpOption and will send a PR your way.

gliwka commented 9 months ago

I've drafted https://github.com/moby/buildkit/pull/4518. Will something like this work for you?

thaJeztah commented 9 months ago

Sorry for being late catching up. I'm a bit on the fence on adding this as an option in the Dockerfile syntax; more so as the mentioned use-case seems like a very, very corner-case (and I would consider that to be a bug on Sourceforge's side - which already has a pretty bad reputation nowadays).

I'm afraid adding this option will be a sliding slope / open a can of worms, and before we know it, we're re-inventing cURL (what other headers or options are needed for a full-fledged HTTP client?).

While not ideal, I think there's workarounds for this case (intermediate stage with cURL or wget installed, or mounting a static busybox binary if you want to be creative).

gliwka commented 9 months ago

The suggested workaround works, though as you mentioned, is not ideal. I interact with buildkit through llb directly, so I workaround this with an additional run with a wget image and then copy the files over.

Unfortunately as much as SourceForge has a bad reputation, many very popular open source packages still have their sources there. They don't consider this a bug either, but a feature to have most browsers hitting their HTML pages.

I understand If you decide against including this in buildkit. In that case, could we change the scope of the change to include the buildkit default user agent when fetching files (same as it's currently implemented for images)? Currently the default golang user agent is being used for this use case, while in other cases the one provided by version.userAgent() is being used for http requests.

Then I can try to persuade SourceForge to include detection for buildkit.

jedevc commented 9 months ago

I'm afraid adding this option will be a sliding slope / open a can of worms, and before we know it, we're re-inventing cURL (what other headers or options are needed for a full-fledged HTTP client?).

I don't think the can of worms is too large here - we should draw the line firmly at what the go http client offers - no extra features on top of that. In reality, from the fields available in Request that I think it makes any sense to allow users to potentially control are Method and Header. We could allow arbitrary headers instead of specifically just user-agents, but I do wonder whether this expands the scope too much. (I'm not suggesting we should add these for this PR, but just that I think that that's where the sliding slope ends).

I interact with buildkit through llb directly, so I workaround this with an additional run with a wget image and then copy the files over.

I have a similar use case in https://github.com/dagger/dagger - users can download files with client.HTTP(url), but this becomes a bit more limited if you can't control much of the request parameters. The workaround with dagger is to use the native functionality in your calling language to download the file, but then you lose out on all buildkit caching functionality that uses modification time and etags to avoid downloading large files unnecessarily when they haven't changed (which helps us be good citizens to the projects that users are downloading data from, by reducing our bandwidth on them).

Perhaps such a change just makes sense in LLB, and not in the dockerfile syntax? That sounds like it would be good enough for the use cases described so far.

gliwka commented 9 months ago

LLB only works for me, I've just added the Dockerfile syntax for completeness. If that's the point of contention, I'm happy to resubmit without the changes to the Dockerfile.

thaJeztah commented 9 months ago

Yeah, perhaps LLB makes more sense (which could be a more generic "headers" option)

crazy-max commented 9 months ago

Yeah I think the change in LLB-only makes sense for now and also this new field should set http headers, not just the user-agent one and therefore be a map.

gliwka commented 9 months ago

Sounds good 👍

I will need to think a bit about how to deal with the headers already set by BuildKit itself and how to include them in the cache key, before I'll open a new PR for allowing the headers to be specified in general as HTTPOption.

thaJeztah commented 9 months ago

and therefore be a map.

note that headers may be specified multiple times, so it would have to be a map[string][]string, but in that case, perhaps http.Header would be best, which does all that 😅

gliwka commented 9 months ago

I will need to extend the SourceOp type to implement this feature.

https://github.com/moby/buildkit/blob/8849789cf8abdc7d63ace61f8dc548582d22f3b5/client/llb/source.go#L21-L28

Currently all HTTPOptions are carried as part of the generic attributes after serialisation: https://github.com/moby/buildkit/blob/8849789cf8abdc7d63ace61f8dc548582d22f3b5/client/llb/source.go#L569

Since HTTP-Headers are not a flat map, but can have the same header multiple times, the flat attrbute map will not be enough to carry headers without applying some encoding to force it into the map shape, which doesn't seem elegant to me.

Do you have any guidance or ideas for me on how to evolve the SourceOp type, that work well in protobuf form as well? SourceOp is currently quite generic and currently doesn't carry any source-specific information in its type definition.

tonistiigi commented 9 months ago

While generic headers in LLB is more flexible I'm not sure if we really want to allow control internals of client-side behavior. Eg. things like keep-alive, authorization etc. Cache-control headers are already provided by the HTTP source. It might be better to define a safe list and additionally allow X-*.

tonistiigi commented 9 months ago

SourceOp is currently quite generic and currently doesn't carry any source-specific information in its type definition.

I think go with the attrs map and apply some encoding.

thaJeztah commented 9 months ago

Since HTTP-Headers are not a flat map, but can have the same header multiple times, the flat attrbute map will not be enough to carry headers without applying some encoding to force it into the map shape

Yeah, the http.Header type (https://pkg.go.dev/net/http#Header) would provide the right structure (but I'm not familiar with how these types are used in BuildKit to say if it's possible to use that type directly).

I'm not sure if we really want to allow control internals of client-side behavior. Eg. things like keep-alive, authorization etc. Cache-control headers are already provided by the HTTP source. It might be better to define a safe list and additionally allow X-*.

Yes, an allow-list (specific headers that can be overridden) and X-** headers makes sense; ISTR we do something similar in docker/cli which also allows for custom headers to be set, but also has similar constraints.

panekj commented 5 months ago

Seems like there is a duplicate issue: