openllb / hlb

A developer-first language to build and test any software efficiently
https://openllb.github.io/hlb/
Apache License 2.0
108 stars 12 forks source link

[WIP] Fix bogus session ID specified for "local" op #335

Closed aaronlehmann closed 1 year ago

aaronlehmann commented 1 year ago

The session ID specified was a randomly generated ID rather than the actual session ID. This casued a 5 second delay for every "local" op because of this code that tries to wait for the specified session: https://github.com/moby/buildkit/blob/abde08a5531d809a395cf648a31bca932b009af0/source/local/local.go#L95-L101

aaronlehmann commented 1 year ago

Not ready for review yet - this breaks at least one test case (but gives great performance improvements for others!)

aaronlehmann commented 1 year ago

This is becoming more complicated than I thought. It's more than I can figure out tonight, unfortunately.

The change here isn't working as intended, because the session gets created after (Local).Call (of course it is... Call populates SessionOpts, etc). So trying to get a session ID from the context either gives us a bogus session ID (like before), or no session ID at all.

Ideally, we would link the local op to the specific session it will be used with, since otherwise BuildKit iterates over running sessions until it finds one that works, which is somewhat nondeterministic. I'm not sure how to do this - it seems like it would require a big refactor to create the session early and then attach a FSSyncProvider later after processing the local ops. However, simply removing the bogus session ID to avoid the 5 second delay should be no worse than what we have today.

Yet, both this PR and outright removal of llb.SessionID break this test case:

fs default() {
        frontend "docker/dockerfile" with option {
                input "context" local("/path/to/context")
                input "dockerfile" local("/path/to/Dockerfile")
        }
}

Specifically, it results in:

 => ERROR local:///path/to/context
failed to solve: no local sources enabled

This was extremely confusing... BuildKit iterates through sessions until it finds one that works with either a bogus session ID or a missing session ID, so why does omitting the session ID make a difference?

I believe the answer comes down to timing. With the bogus session ID, BuildKit inserts a 5 second delay, and I think this gives time for some unrelated session to start up.

The interesting thing is that the solve inside (Frontend).Call seems to go fine, and it's only later in the top-level solve invoked by command.Run where we get the error. I don't understand why that solve would even see the local with the context path - that's a frontend option, so why should it be used outside the frontend solve? Anyway, this seems like the crux of the issue - we must be somehow duplicating this local outside the frontend solve it's intended to be used with, and it was only working by chance before because of the 5 second delay.

aaronlehmann commented 1 year ago

The problem was that (Frontend).Call was not preserving the session options, which are needed to support the context source after the Dockerfile has been transformed to LLB.

Closing in favor of #336.