nixbuild / feedback

nixbuild.net User Feedback
4 stars 0 forks source link

Weird issues with remote store builds #14

Open michaelpj opened 1 year ago

michaelpj commented 1 year ago

This is all trying to set up some CI for a repository. I happen to be expecting the build to fail on a dependency (I tried it locally), but I'm getting weird errors from Nix and not actually a failed derivation build with logs.

  1. "don't know how to build these paths"

Unsure if this is a problem, haven't managed to get to the end so far due to the other issues.

Appears in e.g. https://github.com/input-output-hk/cardano-haskell-packages/actions/runs/4418243762/jobs/7745186405

  1. "dependency failed"

Maybe a dependency failed to build? But no logs?

https://github.com/input-output-hk/cardano-haskell-packages/actions/runs/4417986503/jobs/7744547717

...
building '/nix/store/62yv7ln16y8z7aslfmmv28iv79m7639x-cardano-crypto-class-lib-cardano-crypto-class-2.0.0.1.drv'...
building '/nix/store/dr7z7k8pbsl2nbw8z6iavi6h9h6v6ng2-composition-prelude-lib-composition-prelude-3.0.0.2.drv'...
building '/nix/store/0640v1g1ndpzrzgqn33jx14g0zncfaqs-algebraic-graphs-lib-algebraic-graphs-0.7.drv'...
building '/nix/store/mw1r5c762sx26d7gmnkzmqy6zpi9jxbn-plutus-core-lib-index-envs-1.1.1.0.drv'...
building '/nix/store/zajkhml7r5cfldwajffxhmgcbj73n4gz-bimap-lib-bimap-0.5.0.drv'...
building '/nix/store/qbxrn1gikafjhpfpy7a476w8kfbwwf0m-happy-1.19.11-setup.drv'...
building '/nix/store/bdxvq2kb7vy836557qp4xahkpcyrs1bm-th-lift-instances-lib-th-lift-instances-0.1.20-ghc-8.10.7-env.drv'...
error: [nixbuild.net] '/nix/store/i22fvb148z1z4kp421lslna1nsaxb36h-plutus-core-1.1.1.0.drv': dependency failed
       build cancelled
  1. "non-zero padding"

https://github.com/input-output-hk/cardano-haskell-packages/actions/runs/4418243762/jobs/7745186405

don't know how to build these paths:
  /nix/store/i22fvb148z1z4kp421lslna1nsaxb36h-plutus-core-1.1.1.0.drv
copying 0 paths...
copying 0 paths...
building '/nix/store/62yv7ln16y8z7aslfmmv28iv79m7639x-cardano-crypto-class-lib-cardano-crypto-class-2.0.0.1.drv'...
building '/nix/store/clvhbygc0bnl1phlf7b7yx770n1v0n7g-multiset-lib-multiset-0.3.4.3.drv'...
building '/nix/store/3y5a2c6qab2ji8328f93z4i3hs1vyg24-quickcheck-transformer-lib-quickcheck-transformer-0.3.1.2.drv'...
building '/nix/store/6knjyvlqg8w3hg4ib1xrx2248gamkibn-prettyprinter-configurable-1.1.0.0-setup.drv'...
building '/nix/store/i593ml0h5d570wbhksv1a0gqf9ambz42-dictionary-sharing-lib-dictionary-sharing-0.1.0.0.drv'...
building '/nix/store/0kbfq5251vxb1sw2m9416l9l7bp491r0-testing-type-modifiers-lib-testing-type-modifiers-0.1.0.1.drv'...
building '/nix/store/0640v1g1ndpzrzgqn33jx14g0zncfaqs-algebraic-graphs-lib-algebraic-graphs-0.7.drv'...
building '/nix/store/aqm71ywlkdhrz36i2yxn3azs666z3kqm-base64-bytestring-lib-base64-bytestring-1.2.1.0.drv'...
building '/nix/store/8m1s5p7v9i34y0dcy2jy3v2wqwq9brnl-dom-lt-lib-dom-lt-0.2.3.drv'...
building '/nix/store/rb7bzf0w3spxmz23zlq4gjgw7ldzgqcv-deriving-compat-lib-deriving-compat-0.6.3.drv'...
building '/nix/store/xq8kg6w1rsn3k5wzysfgrwlj2nvl3qvc-tasty-hunit-lib-tasty-hunit-0.10.0.3-ghc-8.10.7-env.drv'...
building '/nix/store/dr7z7k8pbsl2nbw8z6iavi6h9h6v6ng2-composition-prelude-lib-composition-prelude-3.0.0.2.drv'...
building '/nix/store/5mx9ciyqlgmls94fy0c9q7rmy0yyvvcs-lazysmallcheck-lib-lazysmallcheck-0.6.drv'...
building '/nix/store/ysg4a3vccabdgdi5zd7jr0pb3fg9jrb4-parser-combinators-lib-parser-combinators-1.3.0-ghc-8.10.7-env.drv'...
error: non-zero padding

I seem to have progressed to just getting the last issue and haven't got past it.

michaelpj commented 1 year ago

Worth mentioning that these builds do IFD, no idea how that plays out with remote store builds.

rickynils commented 1 year ago

Thanks for reporting this!

  1. "don't know how to build these paths"

This is normal behavior. Nix outputs this locally before kicking of the build remotely. It is a bit weird and should be fixed in Nix.

  1. "dependency failed"

Some input of the target derivation failed to build. Unfortunately, the build that actually failed is sometimes not shown. This is most probably a bug in nixbuild.net, although I'm unsure how vanilla Nix behaves, I think you could get missing logs there too since it only show the last lines of failing builds. Issue #6 is related to this, some research and testing is needed to make nixbuild.net match (or improve) the behavior of plain Nix. A workaround you can use is to add --print-build-logs. This will make all logs visible in the output, which can help you pinpoint the failure (after some scrolling...).

error: non-zero padding

This is something I haven't seen before. I assume it is your local Nix that prints that message. Maybe nixbuild.net sends something incorrect. What version of Nix are you using?

michaelpj commented 1 year ago

What version of Nix are you using?

Not sure, I was using the cachix install-nix action. I bumped the version of that and that seems to have helped, thanks.

Unfortunately, the build that actually failed is sometimes not shown. I'm unsure how vanilla Nix behaves, I think you could get missing logs there too since it only show the last lines of failing builds.

After updating my install-nix action version, which presumably got me a newer Nix, I got a much better error:

error: [nixbuild.net] '/nix/store/i22fvb148z1z4kp421lslna1nsaxb36h-plutus-core-1.1.1.0.drv': dependency failed
       '/nix/store/w8rfhrj8iwwhs3j739iqvbc5drvvmjr0-plutus-core-exe-cost-model-budgeting-bench-1.1.1.0.drv': dependency failed
       '/nix/store/62yv7ln16y8z7aslfmmv28iv79m7639x-cardano-crypto-class-lib-cardano-crypto-class-2.0.0.1.drv': build failed: Cached build failure: builder for '/nix/store/62yv7ln16y8z7aslfmmv28iv79m7639x-cardano-crypto-class-lib-cardano-crypto-class-2.0.0.1.drv' failed with exit code 1

That does actually tell me what failed! I still would really like to see the logs from the failing derivation, I guess I'll use -L for now.

rickynils commented 1 year ago

@michaelpj I think, even with the newer Nix version, you could get the "dependency failed" issue, because it is nixbuild.net itself that outputs those logs, and there is some bug that causes it to some times not show the actual build that failed.

What Nix does in this situation is to print the last 10 lines of the build that failed. However, sometimes that is not enough to find the error. And in the case of remote-store building it is actually not possible to show more lines unless you change the Nix config on the remote machine itself.

What we plan on doing in nixbuild.net is to first of all fix the bugs so the output is in line with standard Nix. Then we could perhaps add a nixbuild.net setting that controls the number of log lines shown. Even better would be to print an URL with a link to the failing build log.

As you say, I think using -L is safest for now so that you see all logs of all builds.

michaelpj commented 1 year ago

Okay, new issues!

https://github.com/input-output-hk/cardano-haskell-packages/actions/runs/4429490683/jobs/7770142361

building '/nix/store/l4348b1a1lb3qck2py53hw8fpij3pl3v-dummy-ghc.drv' on 'ssh://eu.nixbuild.net'...
building '/nix/store/6v3vk8rcv0cxps3mn79qsmjnw0malrgk-dummy-pkg-ghc-8.10.7.drv' on 'ssh://eu.nixbuild.net'...
copying 0 paths...
error: path '/nix/store/7z3l3r1dqzrlr4qmrfgpc8cxjld9q1db-cabal.project.drv' is not a valid store path
copying 0 paths...
error: path '/nix/store/xbsysxaxsbxyd9rh469bihsch3cvqrbg-dummy-ghc-8.10.7.drv' is not a valid store path
copying 0 paths...
error: path '/nix/store/0k0c51xmbj3lvqcqvgqg2dmlm86b2wgn-dummy-ghc.drv' is not a valid store path
copying 0 paths...
copying path '/nix/store/m557ji8jnlqzv00k6rry6npivr1mj2n2-nix-prefetch-git' from 'https://cache.zw3rk.com/'...
error: builder for '/nix/store/7z3l3r1dqzrlr4qmrfgpc8cxjld9q1db-cabal.project.drv' failed with exit code 1
error: builder for '/nix/store/0k0c51xmbj3lvqcqvgqg2dmlm86b2wgn-dummy-ghc.drv' failed with exit code 1
error: 1 dependencies of derivation '/nix/store/n3bznrdz3lp17axgywsvgpmax2bc1k4g-plutus-core-1.1.1.0-plan-to-nix-pkgs.drv' failed to build

I can run the exact same command line locally (i.e. building on nixbuild.net with remote store) and it works. I've got Nix 2.13.2 locally, and the GHA is using 2.13.3. I guess that final minor version might make a difference but not sure 🤔

rickynils commented 1 year ago

I think this is related to IFD, but also to the Nix configuration on the GHA runner. You can see in the logs that it builds on ssh://eu.nixbuild.net, but the remote store should be ssh-ng://eu.nixbuild.net, right? The GHA runner (using nixbuild-action) sets up ssh://eu.nixbuild.net as a remote builder, and when you build your IFD-derivation it will use that remote builder during evaluation, and then switch to the remote store (ssh-ng://) for realisation. I don't know exactly why it doesn't work, but I think Nix doesn't copy the .drv-files to the remote store in this case.

To work around this, add --builders "" --max-jobs 2 to your Nix invocation. We are doing this in the CI workflow, where remote store building is used. I don't know why we haven't documented this properly.

rickynils commented 1 year ago

@michaelpj https://github.com/nixbuild/nixbuild-action/commit/711a1d1c3b1b506a0f1c2aacc5ca6a76724b21a5

michaelpj commented 1 year ago

Does that mean we won't get more than 2x build parallelism on nixbuild.net? That would be a shame!

michaelpj commented 1 year ago

It's also weird that it worked for me locally :thinking: I guess it's non-deterministic?

michaelpj commented 1 year ago

Ah, on my machine I don't have nixbuild.net setup as a remote builder globally, I use it ad-hoc via --builders. So I guess when I use the remote store I don't hit the weird case where it's also set up as a builder.

We are doing this in the CI workflow

Looks like that includes all the options I ended up having to set :D Probably worth documenting all of them, including --print-build-logs? It would have saved me some time to just add those to the list of flags in the docs!

rickynils commented 1 year ago

Does that mean we won't get more than 2x build parallelism on nixbuild.net? That would be a shame!

No, this is only for the builds that Nix must run during evaluation, because of IFD. I actually think there is no parallelism at all during evaulation, so --max-jobs 1 would do. But --max-jobs 0 would not work, and that is what the nixbuild-action normally configures for the GHA runner.

So I guess when I use the remote store I don't hit the weird case where it's also set up as a builder.

Yes, this is correct. You would have hit the issue locally too if you had the remote builder setup there.

Looks like that includes all the options I ended up having to set :D Probably worth documenting all of them, including --print-build-logs? It would have saved me some time to just add those to the list of flags in the docs!

Yes, I will do that! Thank you for patience :)

michaelpj commented 1 year ago

Okay, I got some successful builds, hooray!

I also think it would be worth documenting the "don't know how to build these paths" thing. It's unusual so it looked very suspicious to me, it would be helpful to have something saying it was normal.

rickynils commented 1 year ago

@michaelpj https://github.com/nixbuild/nixbuild-action/#remote-store-building

michaelpj commented 1 year ago

A new one, this one locally:

error: unimplemented worker op: WopQueryRealisation

I guess this is due to your implementation still being partial!

rickynils commented 1 year ago

Yeah, do you know what you did to trigger this?

michaelpj commented 1 year ago

Running a nix build command locally with remote store building. It builds all the derivations sccessfully, and then that's the final output.

rickynils commented 1 year ago

@michaelpj Did you by any chance miss to provide the --eval-store auto option?

michaelpj commented 1 year ago

Nope, that's definitely set.

rickynils commented 1 year ago

Hmm, strange, this is not something I've seen before. What version of Nix are you using?

michaelpj commented 1 year ago

2.14.1. This doesn't seem to cause any problems, but does seem to happen every time.

rickynils commented 1 year ago

Hmm, OK I'll try again to see if I can reproduce this. If you have some example build that this happens for it would be very welcome.

michaelpj commented 1 year ago

I'll get you one.

michaelpj commented 1 year ago

Try this:

nix build 'github:input-output-hk/cardano-haskell-packages#"ghc8107/word-array/0.1.0.0"' --eval-store auto --store ssh-ng://eu.nixbuild.net
rickynils commented 1 year ago

@michaelpj It seems to start building ghc-8.6.5 locally. Are you using IFD that somehow involves ghc?

michaelpj commented 1 year ago

Yes for sure. Sorry, you probably need --accept-flake-config true also, so that you get the caches!

michaelpj commented 1 year ago

(For future context: we're building a bunch of stuff with haskell.nix, which does indeed run Haskell code at evaluation time in order to compute build plans. It works surprisingly well :D )

rickynils commented 1 year ago

I still haven't been able to reproduce the error: unimplemented worker op: WopQueryRealisation issue, even with your build. Have you seen it again?

michaelpj commented 1 year ago

Happens every time for me.

Thinking: I think "realisations" are to do with CA-derivations, and I do have experimental-features = ca-derivations locally. Does setting that make any difference to whether you see it?

michaelpj commented 1 year ago

Yep, if I turn that off I don't get the error. I guess that points the finger fairly clearly, but probably that's not a high priority right now.