lemurheavy / coveralls-public

The public issue tracker for coveralls.io
http://coveralls.io
124 stars 7 forks source link

Parallel builds from CircleCI don't aggregate correctly #1341

Closed danadaldos closed 4 years ago

danadaldos commented 5 years ago

Context:

As I understand it, from looking at documentation and replies to other issues, Coveralls.io has three requirements in order to correctly record and aggregate parallel builds (please correct me if I'm wrong):

1) JSON data for submitted jobs needs to have the parallel: true set either via the ENV var COVERALLS_PARALLEL=true or mix coveralls.circle --parallel(they accomplish the same thing). This suspends the final analysis until the webhook arrives.

2) Incoming jobs must share the same service_number. As long as they are coming in marked "parallel", Coveralls designates these with the shared build ID and a decimal showing its place (i.e. 21989.3, 21989.4).

3) A post to the webhook with the Repo Token in order to signal that the build is finished. Docs recommended: https://coveralls.io/webhook repo_token=$COVERALLS_REPO_TOKEN, other sources recommended adding https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN -d "payload[build_num]=$BUILD_NUMBER&payload[status]=done" to explicitly send the status and the build number.

Description:

I have parallelization on CircleCI building correctly with 4 containers and reporting to Coveralls.io via the excoveralls library.

I have set the --parallel flag when excoveralls runs which correctly adds the parallel: true param to the JSON (see below). I also have the COVERALLS_PARALLEL=true set in various places just to be sure.

As the build runs, I see jobs reporting with the expected 22043.1, 22043.2 designations, but then the jobs are replaced with the later job 22043.3, and finally 22043.4, which is the final job and the one that remains on the build. The results do not aggregate correctly and we see a massive drop in coverage over master. Each container on CircleCI ends with Successfully uploaded the report to 'https://coveralls.io'..

A sanity check with 1 container (parallelization: 1) on CircleCI showed that the splitting and building was working correctly @ ~93% coverage: https://coveralls.io/builds/25348343

I have tried a number of different webhook calls, including the documented:

notify:
  webhooks:
    - url: https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN

As well as one that explicitly includes the done status. Notice that [build_num] has been replaced with [service_number], I have tried both ways. :

notify:
  webhooks:
    - url: https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN -d "payload[service_number]=$CIRCLE_BUILD_NUMBER&payload[status]=done"

Neither this, nor manual calls to curl -k https://coveralls.io/webhook... from the terminal have caused the resulting build to work correctly. Posting to either manually from the terminal gave a response of {"done":true}%, which tells me that it's working correctly (other variations resulted in errors).

Note: I am not using workflows, the "service_job_id" and the "service_number" in the JSON payload are the same number, namely the $CIRCLE_BUILD_NUMBER (see JSON below).

JSON:

{"git":{"branch":"dd-parallelize-ci-test","head":{"committer_name":"danadaldos","id":"f34c07b9920025405bb4ec0ed48f50a00e4c3158","message":"Try relying on manual webhook call"}},"parallel":true,"repo_token":<CORRECT REPO TOKEN REDACTED>","service_job_id":"22043","service_name":"circle-ci","service_number":"22043","service_pull_request":null,"source_files":[{"coverage": ...

Screenshots:

Build 22043 showing first two jobs:

joydrive_joydrive___Build__22043___Coveralls_-_Test_Coverage_History___Statistics

Same build showing only the final job and skewed results:

joydrive_joydrive___Build__22043___Coveralls_-_Test_Coverage_History___Statistics

Related Issues:

https://github.com/lemurheavy/coveralls-public/issues/1191 https://github.com/lemurheavy/coveralls-public/issues/1178 https://github.com/lemurheavy/coveralls-public/issues/1093

kelvintyb commented 4 years ago

Will this be looked at? It's blocking most ppl that are using parallel runs in CircleCI i believe. @afinetooth

afinetooth commented 4 years ago

@kelvintyb this is being looked at. Team is aware, and I will try to reproduce to gain further insight. No ETA yet, but will feed back asap.

nickmerwin commented 4 years ago

Hi @kelvintyb and @danadaldos, could you please post your .circleci/config.yml so we can better understand your setup?

If you'd prefer not to post here publicly, you could email it to us at support@coveralls.io

danadaldos commented 4 years ago

@nickmerwin @afinetooth

These are my current, non-Coveralls.io using setup files. They are currently NOT CONFIGURED TO USE COVERALLS. See the steps listed below for the changes that I made in order to configure Coveralls within this setup.

circle/config.yml: https://gist.github.com/danadaldos/41bf98fe8bafac177fbfe7243bcc2545

test script: https://gist.github.com/danadaldos/ee84345672d07caed447fdad66da61e1

They have changed somewhat since I posted this issue 8 months ago. Namely, now we are using CircleCI workflows, and when we made that change and instituted parallelization on CircleCI, I tried reinstating Coveralls.io (this was about a month ago). I made the following changes:

Configuration Steps

1) Change our test script to run: mix coveralls.circle --parallel ${TESTFILES} Note: The ExCoveralls library adds the "parallel: true" flag to the JSON that is sent with each container when you add the --parallel flag.

2) I add COVERALLS_PARALLEL=true to our circleci/config.yml docker environment:

     docker:
       - image: circleci/elixir:1.8.2-browsers
         environment:
           COVERALLS_PARALLEL: true

I have confirmed via SSH into Circle builds that this is set correctly in the environment.

3) I add a final step in circleci/config.yml:

  webhooks:
    - url: https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN -d "payload[build_num]=$CIRCLE_WORKFLOW_ID&payload[status]=done"

I have tried calling this in various ways and from various places, including a separate job in CircleCI that posts the correct payload[build_num]=$CIRCLE_WORKFLOW_ID as well as calling this manually from my terminal once all test containers have finished. I have tried other IDs also, CIRCLE_WORKFLOW_ID is the one shared by all containers.


I am remiss to post any of the specific circle/config.yml files that I used because I have tried tweaking literally every variable I could think of to try to get Coveralls to recognize all of the containers in a build. If any of the steps I listed above are wrong, please let me know.

nickmerwin commented 4 years ago

@danadaldos I believe this may be the issue:

Note: I am not using workflows, the "service_job_id" and the "service_number" in the JSON payload are the same number, namely the $CIRCLE_BUILD_NUMBER (see JSON below).

Because the excoveralls lib uses the CIRCLE_BUILD_NUM environment variable for Job ID:

  defp get_job_id do
    # When using workflows, each job has a separate `CIRCLE_BUILD_NUM`, so this needs to be used as the Job ID and not
    # the Job Number.
    System.get_env("CIRCLE_BUILD_NUM")
  end

https://github.com/parroty/excoveralls/blob/master/lib/excoveralls/circle.ex#L65

It's expecting to be run within a workflow so that this is unique per parallel job.

Here's how our Ruby library handles it:

config[:service_job_number]   = ENV['CIRCLE_NODE_INDEX']

https://github.com/lemurheavy/coveralls-ruby/blob/master/lib/coveralls/configuration.rb#L66

Otherwise, Coveralls thinks that it's a duplicate job since the Build number and Job Id match, so it removes them. Which is why you're seeing multiple at first in our UI, then they're removed shortly after.

The Elixir lib may need to be updated to support this non-workflow based Circle parallel setup to use CIRCLE_NODE_INDEX additionally.

E.g.:

  defp get_job_id do
    "#{System.get_env("CIRCLE_BUILD_NUM")}-#{System.get_env("CIRCLE_NODE_INDEX")}"
  end
danadaldos commented 4 years ago

@nickmerwin Yes, please see my most recent comment. I updated the information. We are using workflows now and I am sending "payload[build_num]=$CIRCLE_WORKFLOW_ID&payload[status]=done" to Coveralls.

nickmerwin commented 4 years ago

Thanks @danadaldos can you link me to your most recent test build using the new workflows setup?

danadaldos commented 4 years ago

@nickmerwin Here is our most recent build on Coveralls.io: https://coveralls.io/builds/29361557 When we run Coveralls locally, we're at ~94% coverage.

To be clear, this build was set up exactly as I mentioned: COVERALLS_PARALLEL: true, mix coveralls.circle --parallel, `webhooks:

nickmerwin commented 4 years ago

Thanks @danadaldos, could you SSH into a build and confirm that CIRCLE_WORKFLOW_WORKSPACE_ID is being set?

It appears that CIRCLE_BUILD_NUM is ...5DF4D28A52B7 and is coming over to Coveralls as both the service_number and service_job_id, which is why 6 out of the 7 jobs are considered duplicates and are being culled.

danadaldos commented 4 years ago

@nickmerwin Yeah, it will take me a minute to get things reconfigured for Coveralls.

danadaldos commented 4 years ago

@nickmerwin I'm sorry, I was giving you wrong information. I actually did get this to build correctly. The issue that I'm now having is that in order for this to build correctly, it's taking 30 minutes to do so. Our test suite on CircleCI finishes after 5 minutes normally. Here is a link to a build around the same time that built correctly with all containers: https://coveralls.io/builds/29358320 You can see on that build that the Job ID is 4c383bb7-6b4b-4369-a459-0ed7e4d9bfe2.40 with the .40 indicating the number of containers.

Any insight as to why it takes so long? I have a build currently running on CircleCI that I will link to once it reports to Coveralls.

danadaldos commented 4 years ago

And just to be thorough, My working setup is: 1) Use mix coveralls.circle --parallel ${TESTFILES} (same as above)

2) Set COVERALLS_PARALLEL: true to the docker env in CircleCI (same as above)

3) Report finished workflow via a job in CircleCI, which is the same approach they use in the orb: https://circleci.com/orbs/registry/orb/coveralls/coveralls

...
  notify_coveralls:
    docker:
      - image: circleci/elixir:1.8.2-browsers
        environment:
          COVERALLS_PARALLEL: true
    steps:
      - run: |
          curl "https://coveralls.io/webhook?repo_token=$COVERALLS_REPO_TOKEN" \
            -d "payload[build_num]=$CIRCLE_WORKFLOW_ID&payload[status]=done"
          exit 0

workflows:
  build_and_test:
    jobs:
      - run_credo
      - run_tests
      - notify_coveralls:
            requires:
              - run_tests

These are the only changes between a 5 minute build and a 30 minute build with Coveralls.

danadaldos commented 4 years ago

@nickmerwin

Most recent job finished after 40 minutes: https://coveralls.io/builds/29978345

Here's a screenshot of our workflows for comparison's sake:

Pipelines_-_joydrive_joydrive_and_calliope_Calliope____workspace_joydrive_joydrive-app_and_Searching_“This_Mac”

nickmerwin commented 4 years ago

@danadaldos I checked the calculation time for that build on our side and it was only 2.2 seconds after the webhook came in. I suspect Circle may have queued up the webhook for those 40 minutes. Perhaps you could add another webhook receiver like https://requestbin.com to confirm the delay.

We keep metrics on how quickly our background processors dequeues jobs here:

https://status.coveralls.io

image

Since the original issue of parallel coverage run merging is resolved, I'm closing the issue for now, but will monitor the thread for any other questions that arise.

Thank you!

danadaldos commented 4 years ago

@nickmerwin Thank you for following-up on this. It's extremely helpful to know how long it took on your end, and now I can take that information to CircleCI to see what's up. Thank you again!

Update - I did try sending the webhook to Requestbin.com and there was no delay. The CI job finished in ~5 minutes and Requestbin received the message right when it finished.

Finally, if I still have your ear @nickmerwin, please please address the documentation found here: https://docs.coveralls.io/parallel-build-webhook I am fairly certain that the webhook listed for CircleCI flat-out wrong. CircleCI must have changed its implementation since that was written because if you don't explicitly include the payload with the build number/workflow id, Coveralls doesn't report anything.