openllb / hlb

A developer-first language to build and test any software efficiently
https://openllb.github.io/hlb/
Apache License 2.0
108 stars 12 forks source link

Revert lazy-eval #323

Closed coryb closed 2 years ago

coryb commented 2 years ago

The async usage is causing too many issues for us, we have not been able to use master for a while. It works great for small-ish graphs but where there are hundreds of vertices in the graph that are lazy evaluated we were seeing thousands of goroutines get launched which seemed to overwhelm buildkit, so we started seeing timeouts, disconnects and other issues that appear to be related to load/usage.

The concurrency limiting work helped with the load, but we ended up with deadlocks where all the running work was blocked on lazy-evals that were unable to run due to limits.

I tried several approaches to try to reproduce this in an isolated test case, but so far have been unable.

Lets revert the lazy-eval stuff for now until, hopefully we can come up with some test-cases to reproduce the issues and see if we can fix the laxy-eval stuff in the future.

coryb commented 2 years ago

I am going to hold off on this, we still see some issues even after this revert. The latest error we are seeing is:

Retrying build due to error: failed to execute hlb: failed to solve: ResourceExhausted: trying to send message larger than max (65666405 vs. 16777216)

which is unrelated to the async. I am trying to work through a git bisect now to try to pinpoint exactly where things starting having problems.

aaronlehmann commented 2 years ago

I think a key issue with the async ref resolution is that it leaks Build calls. NewRemoteDirectory's Build goroutine blocks inside the Build until the returned object's Close method is called, but this never happens.

Before this refactor, refs were always resolved and used right away, then closed: https://github.com/openllb/hlb/blob/5ab9a768e2b611e539b28c903a80cbffcf088b11/module/resolve.go#L402-L406

After the refactor, the refs are long-lived and are stored in the ast.Module structure for future use. The call to Close on the resolved directory was dropped in the commit that did this refactor, and doesn't appear to be invoked anywhere (removing Close from the ast.Directory interface and removing the implementation from remoteDirectory doesn't break compilation). This may have been intentional based an an assumption that there will be a reasonable number of imports; however some programatically generated use cases involve very large numbers of imports, leading to many parallel sessions.

This also explains why the concurrency limiter is causing problems. Because NewRemoteDirectory builds are never allowed to complete, we hit the concurrency limit and get stuck at that point.

I am not sure what the right fix is. One option is to go ahead with this revert, and that's probably the easiest path. I'm not sure if there's a place where resolved directories can safely be closed under the new design. If not, one approach would be to always close refs immediately after resolving them, but save the arguments we used for the resolution, so we can re-resolve later if needed. Again, I'm not sure if it's worth doing at this moment.

@hinshun Do you have any thoughts on this?

hinshun commented 2 years ago

I think it makes sense to revert the changes until the lifecycle of these refs can be better constrained in the next iteration of async.

coryb commented 2 years ago

I am going to close this one, after digging in a bit more we also need to revert several more PRs all the way back to #290 which is pretty painful with merge conflicts that are pretty difficult to resolve. The new idea is to reimplement NewRemoteDirectory so we can manage the lifecycle of the solve (current plan is to try to move solve into the Open/Stat calls so it is still lazy evaluated).

hinshun commented 2 years ago

@coryb Would it make sense to branch off from before #290 and work from there instead of reverting everything? Cherry-pick what you need, etc (or even name it as master, while keeping current master as another branch).

coryb commented 2 years ago

yeah, I started there, reset to 290 and started cherry picking, but it also just became a ton of merge conflicts very quickly.

I think the only real problem we are having is the remote directory lingering solve that never closes, so I think we can probably "fix forward" by simplifying that for now, it will be suboptimal but maybe viable. Will do some testing...