hercules-ci / hercules-ci-agent

https://hercules-ci.com build and deployment agent
Apache License 2.0
97 stars 19 forks source link

Derivations fails to build with Could not retrieve derivation #314

Open Mic92 opened 3 years ago

Mic92 commented 3 years ago

Description

Agent cannot build a derivation. However using trying the same derivation, with nix-build works:

FatalError {fatalErrorMessage = "Could not retrieve derivation 4hifafcpl36hynsggim166ngbyhwzvmn-nixos-system-bernie-21.11.20210813.f95a858.drv from local store or binary caches."} (worker: ExitFailure 1)
$ nix-build /nix/store/bswnna1nnkqmxwarl979jpnjk6hda5nn-nixos-system-bernie-21.11.20210813.f95a858

To Reproduce

I use hercules-ci-agent on master: https://github.com/Mic92/dotfiles/blob/f52d35be19c16cc166040ab5c7cdbb4686f8434f/flake.nix#L73

The configuration is pretty standard so: https://github.com/Mic92/dotfiles/blob/master/nixos/eve/modules/hercules-ci.nix

I use nixUnstable as shown here: https://github.com/Mic92/dotfiles/blob/f52d35be19c16cc166040ab5c7cdbb4686f8434f/nixos/configurations.nix#L43

Expected behavior

The derivation builds

Logs

Link to the failed repository:

https://hercules-ci.com/accounts/github/Mic92/derivations/%2Fnix%2Fstore%2F4hifafcpl36hynsggim166ngbyhwzvmn-nixos-system-bernie-21.11.20210813.f95a858.drv/log?via-job=c25712e1-7ee0-4ebe-a86d-98034ca84661

https://github.com/Mic92/dotfiles/blob/master/ci.nix

Platform / Version

Best to go to https://hercules-ci.com/dashboard and click on the agents' tab for the account you're interested in. hercules-ci-agent --help version

0.8.2, happens both on aarch64-linux and x86_64-linux.

Mic92 commented 3 years ago

Looks like trying to rebuild works around the issue: https://hercules-ci.com/github/Mic92/dotfiles/jobs/270

roberth commented 3 years ago

I haven't seen this occur in evaluation before. Any chance garbage collection was running at the time of the error? 2021-08-16 15:13:53+00 (UTC) for the first error.

Mic92 commented 2 years ago

I have not seen this issue in a while. I will re-open if it comes back.

dhess commented 2 years ago

Can we re-open this? I've just started re-evaluating Hercules CI and I'm getting this quite a bit, on multiple derivations, on both x86_64-darwin and x86_64-linux machines:

FatalError {fatalErrorMessage = "Could not retrieve derivation 6a0syh00vimpdrccnq9n210c5bbinqs1-source.drv from local store or binary caches."} (worker: ExitFailure 1)

Restarting the job on x86_64-darwin often seems to fix the issue. On x86_64-darwin it's reproducible.

My config is as stated here: https://github.com/hercules-ci/hercules-ci-agent/issues/231#issuecomment-999969758

dhess commented 2 years ago

We use a private binary cache on S3. All of our builders have access to it, but only via the Nix daemon, which has the proper credentials in its ~root/.aws/credentials file.

Does the Hercules CI agent attempt to read derivations from any binary caches directly? That might explain what's going on in our case, since the agent doesn't have the credentials.

(Note that the Hercules CI agent is not configured to write this S3 private binary cache. For now, we're writing Hercules CI build products to a new Cachix cache, so that's the only cache configured in binary-caches.json.)

edit: Unfortunately not. I've now granted read-only access to the S3 private binary cache to our Hercules CI agents via ~hercules-ci-agent/.aws/credentials and this error is still happening.

roberth commented 2 years ago

(Note that the Hercules CI agent is not configured to write this S3 private binary cache. For now, we're writing Hercules CI build products to a new Cachix cache, so that's the only cache configured in binary-caches.json.)

If you've configured that cache on all agents, it should be usable for the purpose of distributing the drvs. The evaluating agent will push it to Cachix and the building agent will pull from it.

Does the Hercules CI agent attempt to read derivations from any binary caches directly?

It does so because it can be significantly faster than downloading the whole drv closure. You could try with services.hercules-ci-agent.settings.nixUserIsTrusted = lib.mkForce false; to disable this behavior and let the daemon handle the substitution. It should just fetch it from your Cachix cache in the first place though.

dhess commented 2 years ago

Does the Hercules CI agent attempt to read derivations from any binary caches directly?

It does so because it can be significantly faster than downloading the whole drv closure. You could try with services.hercules-ci-agent.settings.nixUserIsTrusted = lib.mkForce false; to disable this behavior and let the daemon handle the substitution. It should just fetch it from your Cachix cache in the first place though.

Right, that makes sense. In any case, see my edit to the comment above: I've added the necessary read-only access for the Hercules CI agent to our private S3 binary cache, but this problem persists.

roberth commented 2 years ago

It turns out that the problem was not with cache configuration. I've opened #355 to solve this particular problem ("Could not retrieve derivation"). The real root cause is a crash in the evaluator, that I will also debug; see #356.

zowoq commented 1 year ago

Had this occur twice in the last two weeks in nix-community, agents are at 0.9.10.

https://github.com/nix-community/infra/pull/428 https://github.com/nix-community/infra/pull/441

Another:

https://github.com/nix-community/infra/pull/454

These PRs all include a nixpkgs staging-next merge.

roberth commented 1 year ago

Observed this after restarting a job. Might be related to having a concurrent job or concurrent restart job.

zowoq commented 1 year ago

Concurrent jobs with a large amount of "new" derivations (a nixpkgs update that includes a staging-next merge) seems to have been what was causing this in nix-community/infra.