nrwl / nx

Smart Monorepos · Fast CI
https://nx.dev
MIT License
22.64k stars 2.26k forks source link

Upgrading to NX 19 causes CI to fail with `The Nx Cloud heartbeat process failed to report its status in time` #25300

Open eamon0989 opened 1 month ago

eamon0989 commented 1 month ago

Current Behavior

I opened a PR on our repo to upgrade NX from 18 to 19, and our CI fails constantly with the error message: The Nx Cloud heartbeat process failed to report its status in time.

We have other PRs open on NX 18 and Nx Cloud works as expected. There are no other changes in the PR except upgrading nx. We upgraded using nx migrate latest. This has happened with 19.04, 19.05, and 19.06. I haven't tried with 19.01 to 19.03.

When I open nx cloud and look at the pipeline execution, it runs and finishes as expected, so it seems there is some sort of change in the way that heartbeat reports?

The exact error we get is as follows:

 NX   Unable to complete a run.

CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information.

Error: CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information.
    at Tr.startV2 (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:5691)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async wB (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:10372)
    at async Object.EB (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:11066)
    at async anyFailuresInPromise (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:227:26)
    at async invokeTasksRunner (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:192:23)
    at async /home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:89:24
    at async handleErrors (/home/runner/work/platform/platform/node_modules/nx/src/utils/params.js:10:16)
    at async runCommand (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:83:20)
    at async Object.runMany (/home/runner/work/platform/platform/node_modules/nx/src/command-line/run-many/run-many.js:45:24)

 NX   Completing with an error

Expected Behavior

I expect that nx cloud will continue to comunicate with my ci runner and report the success or failure of the jobs.

GitHub Repo

No response

Steps to Reproduce

The main job (the one that fails):

  main:
    runs-on:
      labels: [ubuntu-latest-4-cores]
    needs: [init, api-specs, regenerate]
    if: |
      !failure() && !cancelled() &&
      (github.event_name == 'release' ||
      (needs.init.outputs.has-projects-affected == 'true' &&
      (github.event_name == 'push' || !github.event.pull_request.draft)))
    timeout-minutes: 30
    env:
      NX_BASE: ${{ needs.init.outputs.base }}
      NX_HEAD: ${{ needs.init.outputs.head }}
      NX_CLOUD_DISTRIBUTED_EXECUTION_AGENT_COUNT: ${{ needs.init.outputs.agents-count }}

    services:
      rabbitmq:
        image: ghcr.io/s1seven/rabbitmq:latest
        credentials:
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
        env:
          RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS:
            -rabbitmq_auth_backend_http topic_path "http://172.17.0.1:3000/topic"'
        ports:
          - 5672:5672

      postgres:
        image: postgres
        env:
          POSTGRES_DB: postgres
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ needs.init.outputs.fetch-ref }}

      - name: Compute Git fetch depth
        id: fetch-depth
        run: echo "value=${{ needs.regenerate.outputs.base-depth || needs.init.outputs.base-depth }}" >> $GITHUB_OUTPUT

      - name: ${{ env.STEP_SETUP_PROJECT }}
        id: setup
        uses: ./.github/actions/checkout-and-yarn
        with:
          fetch-depth: ${{ steps.fetch-depth.outputs.value }}
          fetch-ref: ${{ needs.init.outputs.fetch-ref }}
          node-version: ${{ env.NODE_VERSION }}
          token: ${{ secrets.GITHUB_TOKEN }}

      - name: Coordinates Nx agents
        run: npx nx-cloud start-ci-run

      - name: Get Nx apps to build
        id: build-apps
        run: |
          if [[ "${{ github.event_name }}" == "release" ]]; then
            echo "list=$(yarn get:apps)" >> $GITHUB_OUTPUT
          else
            echo "list=$(yarn affected:apps | tr -d ' ')" >> $GITHUB_OUTPUT
          fi

      - name: log nx apps to build
        run: echo "${{ steps.build-apps.outputs.list }}"

      - name: Run verifications for affected apps
        uses: jameshenry/parallel-bash-commands@v1
        with:
          cmd1: NX_CLOUD_DISTRIBUTED_EXECUTION=false yarn scan:openapi
          cmd2: npx nx affected --target=lint --parallel=4 --verbose
          cmd3: npx nx affected --target=stylelint --parallel=4 --verbose
          cmd4: npx nx affected --target=test --parallel=4 --exclude=platform,tools --ci --verbose
          cmd5: npx nx run-many --target=build --parallel=4 --projects=${{ steps.build-apps.outputs.list }} --verbose
          # running e2e tests in parallel create conflicts in DB
          cmd6: NX_CLOUD_DISTRIBUTED_EXECUTION=false yarn affected:e2e:backend --ci --verbose

      - name: Stop Nx agents
        if: always()
        run: npx nx-cloud stop-all-agents

Nx agents worklow:


  # NX AGENTS
  agents:
    runs-on:
      labels: [ubuntu-latest-4-cores]
    needs: [init, api-specs, regenerate]
    if: |
      !failure() && !cancelled() &&
      (github.event_name == 'release' ||
      (needs.init.outputs.has-projects-affected == 'true' &&
      (github.event_name == 'push' || !github.event.pull_request.draft)))
    name: Agent
    timeout-minutes: 25
    env:
      NX_CLOUD_DISTRIBUTED_EXECUTION_AGENT_COUNT: ${{ needs.init.outputs.agents-count }}

    services:
      postgres:
        image: postgres
        env:
          POSTGRES_DB: postgres
        ports:
          - 5432:5432

      redis:
        image: redis
        ports:
          - 6379:6379

    strategy:
      matrix:
        # number of agents proportional to number of affected projects
        agent: ${{ fromJSON(needs.init.outputs.agents-matrix) }}

    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ needs.init.outputs.fetch-ref }}

      - name: ${{ env.STEP_SETUP_PROJECT }}
        id: setup
        uses: ./.github/actions/checkout-and-yarn
        with:
          fetch-ref: ${{ needs.init.outputs.fetch-ref }}
          node-version: ${{ env.NODE_VERSION }}
          token: ${{ secrets.GITHUB_TOKEN }}

      - name: Start Nx Agent ${{ matrix.agent }}
        run: npx nx-cloud start-agent

Nx Report

npx nx report       
 NX  Falling back to ts-node for local typescript execution. This may be a little slower.
  - To fix this, ensure @swc-node/register and @swc/core have been installed

 NX   Report complete - copy this into the issue template

Node   : 20.12.1
OS     : darwin-arm64
yarn   : 3.6.3

nx                 : 19.0.5
@nx/js             : 19.0.5
@nrwl/js           : 18.0.8
@nx/jest           : 19.0.5
@nrwl/jest         : 18.0.8
@nx/linter         : 18.0.8
@nx/eslint         : 19.0.5
@nx/workspace      : 19.0.5
@nrwl/workspace    : 18.0.8
@nx/angular        : 19.0.5
@nx/cypress        : 19.0.5
@nx/devkit         : 19.0.5
@nrwl/devkit       : 18.0.8
@nx/eslint-plugin  : 19.0.5
@nx/nest           : 19.0.5
@nx/node           : 19.0.5
@nrwl/node         : 17.2.8
@nx/plugin         : 19.0.5
@nrwl/nx-plugin    : 18.0.8
@nrwl/tao          : 18.0.8
@nx/web            : 19.0.5
@nx/webpack        : 19.0.5
typescript         : 5.4.4
---------------------------------------
Community plugins:
@auth0/auth0-angular   : 2.2.3
@getlarge/nx-heroku    : 0.4.2
@jscutlery/semver      : 4.2.0
@maskito/angular       : 1.9.0
@ngneat/tailwind       : 7.0.3
@ngneat/transloco      : 4.3.0
@nx-tools/nx-container : 5.1.0
@nx/aws-lambda         : 17.2.3
@rx-angular/cdk        : 16.0.0
@rx-angular/state      : 16.0.0
@rx-angular/template   : 16.0.1
@taiga-ui/cdk          : 3.62.0
@taiga-ui/core         : 3.62.0
nx-stylelint           : 17.1.1
---------------------------------------
The following packages should match the installed version of nx
  - @nrwl/js@18.0.8
  - @nrwl/jest@18.0.8
  - @nx/linter@18.0.8
  - @nrwl/workspace@18.0.8
  - @nrwl/devkit@18.0.8
  - @nrwl/node@17.2.8
  - @nrwl/nx-plugin@18.0.8
  - @nrwl/tao@18.0.8

To fix this, run `nx migrate nx@19.0.5`

Failure Logs

NX   No explicit --base argument provided, but found environment variable NX_BASE so using its value as the affected base: 8ea68662f42017e67eeb5e6c57ff11a90a6536df

 NX   No explicit --head argument provided, but found environment variable NX_HEAD so using its value as the affected head: 06349c93d7ab444677ca8d3b209472925d14f14c

 NX   No explicit --base argument provided, but found environment variable NX_BASE so using its value as the affected base: 8ea68662f42017e67eeb5e6c57ff11a90a6536df

 NX   No explicit --head argument provided, but found environment variable NX_HEAD so using its value as the affected head: 06349c93d7ab444677ca8d3b209472925d14f14c

 NX   No explicit --base argument provided, but found environment variable NX_BASE so using its value as the affected base: 8ea68662f42017e67eeb5e6c57ff11a90a6536df

 NX   No explicit --head argument provided, but found environment variable NX_HEAD so using its value as the affected head: 06349c93d7ab444677ca8d3b209472925d14f14c

[NX CLOUD] Verifying current cloud bundle
[NX CLOUD] A local bundle currently exists:  {
  version: '2405.02.15.hotfix5',
  fullPath: '/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5'
}
[NX CLOUD] Last verification was within the past 30 minutes, will not verify this time
[NX CLOUD] Done:  /home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5

 NX   Trying to create heartbeat background process for run group: 9192644671-1

[Nx Cloud Debug] Attempting to acquire filesystem lock with path:  /tmp/run-group-9192644671-1-marker.lock
[Nx Cloud Debug] Successfully created folder lock at path: /tmp/run-group-9192644671-1-marker.lock
[Nx Cloud Debug] Attempting to write current PID to owner file: 4610
[Nx Cloud Debug] Successfully acquired lock

 NX   Heartbeat process started successfully with PID 4722

 NX   Starting distributed command execution (v2)

 NX   Starting a distributed execution

{
  "command": "nx run-many --target=build --parallel=4 --projects=api-gateway,frontend,logger-service,user-service --verbose",
  "branch": "1496",
  "runGroup": "9192644671-1",
  "ciExecutionId": "9192644671-1",
  "ciExecutionEnv": "",
  "stopAgentsOnFailure": false,
  "retryFlakyTasks": true,
  "maxParallel": 4,
  "command": "nx affected --target=stylelint --parallel=4 --verbose",
  "branch": "1496",
  "runGroup": "9192644671-1",
  "ciExecutionId": "9192644671-1",
  "ciExecutionEnv": "",
  "stopAgentsOnFailure": false,
  "retryFlakyTasks": true,
  "maxParallel": 4,
  "agentCount": 8
}

 NX   Unable to complete a run.

CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information.

Error: CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information.
    at Tr.startV2 (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:5691)

    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
 NX   Completing with an error
    at async wB (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:10372)

    at async Object.EB (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:11066)
ciExecutionId: 9192644671-1
    at async anyFailuresInPromise (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:227:26)
ciExecutionEnv: 
    at async invokeTasksRunner (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:192:23)
runGroup: 9192644671-1
    at async /home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:89:24
error: Main job terminated with an error: "CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information."
    at async handleErrors (/home/runner/work/platform/platform/node_modules/nx/src/utils/params.js:10:16)

[Nx Cloud] Detected Env: GitHub Actions
    at async runCommand (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:83:20)
    at async Object.affected (/home/runner/work/platform/platform/node_modules/nx/src/command-line/affected/affected.js:52:36)
Error: CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information.
    at Tr.startV2 (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:5691)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async wB (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:10372)
    at async Object.EB (/home/runner/work/platform/platform/.nx/cache/cloud/2405.02.15.hotfix5/index.js:38:11066)
    at async anyFailuresInPromise (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:227:26)
    at async invokeTasksRunner (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:192:23)
    at async /home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:89:24
    at async handleErrors (/home/runner/work/platform/platform/node_modules/nx/src/utils/params.js:10:16)
    at async runCommand (/home/runner/work/platform/platform/node_modules/nx/src/tasks-runner/run-command.js:83:20)
    at async Object.affected (/home/runner/work/platform/platform/node_modules/nx/src/command-line/affected/affected.js:52:36)

 NX   Unable to complete a run.

CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information.

 NX   Completing with an error

ciExecutionId: 9192644671-1
ciExecutionEnv: 
runGroup: 9192644671-1
error: Main job terminated with an error: "CI execution '9192644671-1' failed. Reason: The Nx Cloud heartbeat process failed to report its status in time. Visit https://nx.dev/ci/recipes/troubleshooting/ci-execution-failed#the-nx-cloud-heartbeat-process-failed-to-report-its-status-in-time for more information."

Package Manager Version

yarn, 3.6.3

Operating System

Additional Information

I have verified that the issue is not present with Nx 18, and that it only appears when we upgrade to Nx 19.

meeroslav commented 1 month ago

Hi @eamon0989,

While this might not be directly related to your issue, your nx report shows duplicate packages - you have both @nx/* in version 19 and @nrwl/* is version 18 installed. This might be a sign of manual package update being done sometime in the past.

Can you remove all the @nrwl/* packages and leave just nx and @nx/* packages? You should be using yarn nx migrate latest whenever your need to migrate to latest version. This not only bumps the versions of packages but also automatically runs any necessary migrations scripts.

eamon0989 commented 3 weeks ago

An update for anyone that comes across this issue in the future.

First, the duplicate nx packages: we have always updated nx using npx nx migrate latest, never manually.

I've taken a look into the duplicate nx/nrwl dependencies, and it seems they come from devDependencies of some of our dependencies. Our package.json only contains the latest versions of the packages of nx, the older and duplicate versions come from dependencies. I was not aware of this, but apparently it is the default behaviour of yarn, which we use as a package manager - see https://stackoverflow.com/questions/49530678/why-does-yarn-install-dev-dependencies-but-i-just-need-the-builds.

As for the heartbeat issue, I contacted nx support and they recommended adding the flag --require-explicit-completion to nx-cloud start-ci-run. The ci still failed, and I was given this suggestion:

Could you please verify that in the full pipeline, you don't run any Nx targets before the npx nx-cloud start-ci-run --require-explicit-completion?

For example, I see that the main job needs [init, api-specs, regenerate]. If any of those other jobs are running Nx commands first, then they will create the nx-cloud run automatically first, without the --require-explicit-completion option. The fix for this is to make sure the npx nx-cloud start-ci-run --require-explicit-completion command happens before all Nx targets are run in the pipeline.

I commented out the jobs that ran before our main job and the ci passed, it turns out that one of our earlier jobs was using an nx target, causing the nx-cloud to be created. I'm guessing it was nx format:write.

TLDR: the issue was caused by calling an nx target in a previous job which was starting nx-cloud before we expected it to be started.

Thanks to nx support for helping me debug the issue!