[Epic] Improve CI for openssl repo

quarckster commented 1 month ago

OpenSSL repository has a lot of GitHub Actions workflows which are running on various events. Each PR starts 80 jobs and it takes around 30 minutes to finish all of them. OpenSSL CI must be divided in two stages. First stage is smoke testing. We check if the code can be compiled and that's it. This will significantly increase developer experience and will prevent the situation when the developer has to wait too long for the next iteration. When the PR is ready for the second stage the rest of OpenSSL CI checks will be started on demand. The second stage can be triggered by submitting a specific comment or use an environment to manually trigger the jobs.

quarckster commented 1 month ago

cc @nhorman @t8m @mattcaswell @levitte

nhorman commented 1 month ago

Generally speaking I like the idea. The big concern I have is in the trade off of running multiple staged jobs and the artifact creation/restoration that it entails. Currently the general workflow we have for a given job is 2(-ish) steps:

1) Compile with a given configuration 2) run tests

Which is nice an simple. If we were to stage these tasks into separate jobs (so that we could see compilation results prior to test results):

Compilation a. Compile with a given configuration b. archive configuration as an artifact
Testing (conditional on (1) completing successfully) a. Restore archived configuration from artifacts b. test built artifacts

Which is fine, but creates extra overhead in the frequent archive/restore operations. I have no idea if thats a real problem (both from a performance point of view or a financial point of view, assuming that archiving artifacts is a cost impact to our ci bill), but its something I noticed.

If the answer to the above question is "no, its not a problem", then this seems like a very reasonable modification.

Just to offer an alternative approach, we could tier our jobs - i.e. PR's in draft state only do smoke testing, while non draft PR's get the full build/test treatment. If the above is a potential issue, then this might be a reasonable alternative to get faster development cycles for code, while still preserving our full PR runs.

beldmit commented 1 month ago

Adding extra manual step doesn't look like a good idea to me. If the tests could be run after smoke tests, it would be better.

t8m commented 1 month ago

Adding extra manual step doesn't look like a good idea to me. If the tests could be run after smoke tests, it would be better.

Yeah, I completely agree with you. I would not be against somehow reorganizing the CI and staging the full tests only after some smoke tests pass would be nice. However this needs to be an automated process not requiring manual intervention.

nhorman commented 1 month ago

I'm not sure what about the above proposals are manual?

In the stage jobs case, it's all automated, just split into two serialized jobs that pass artifacts between them.

In the latter case, I'm just talking about limiting job execution based on if a PR is in draft state or not

beldmit commented 1 month ago

This wording

The second stage can be triggered by submitting a specific comment

looks like it assumes manual intervention to me. Sorry if I'm wrong

t8m commented 1 month ago

We were talking about the initial proposal by @quarckster.

Your proposal @nhorman is fine in this regard.

quarckster commented 1 month ago

Changing the PR state is also a manual action and it will require changing of the current workflow. I'm not against it but I guess it will require a consensus. Running the second stage automatically in some cases won't give a speed benefit. Maybe we could limit the concurrency within the PR and not to wait until the CI finishes.

baentsch commented 1 month ago

From the perspective of an external contributor, the full test set/wait time is indeed very long. Having (the possibility to trigger) quick CI smoke tests ("what's causing trouble in most PRs") would be great. Manual intervention as to the level of CI performed could be expected of the PR authors, I'd argue: Beyond the difference between requesting a full test for "Ready for Review" PRs and (a smaller test suite for) draft PRs, what about the idea to derive the amount of testing from the latest commit message? e.g., "[skip ci]" on documentation-only changes or PRs not intended for immediate integration ("long-term draft PRs" or "discussion PRs") or "[smoke test]" for people aware of the (also environmental) costs unnecessary complete CI runs cause? Full testing of course should be done if a PR passes the initial tests and the author wants it merged to "master" -- but then he can also "pay the waiting price".

What I did so far before a PR is run make tests locally to cut down the CI wait time (works only because I have a nice number of otherwise idling CPUs on my laptop, though :). If that's not suitable/enough (?) what about a make target "pr-test" (the equivalent of the smoke tests proposed and surely triggering code formatting tests)? That may also cut down on unnecessary CI (and wait time). Running that (and passing prior to PR) should surely be documented in https://github.com/openssl/openssl/blob/master/CONTRIBUTING.md

nhorman commented 1 month ago

We could do something like this (simplified version of the basic-gcc test in our ci):

# Copyright 2021-2024 The OpenSSL Project Authors. All Rights Reserved.
#
# Licensed under the Apache License 2.0 (the "License").  You may not use
# this file except in compliance with the License.  You can obtain a copy
# in the file LICENSE in the source distribution or at
# https://www.openssl.org/source/license.html

name: GitHub CI

on: [push]

# for some reason, this does not work:
# variables:
#   BUILDOPTS: "-j4"
#   HARNESS_JOBS: "${HARNESS_JOBS:-4}"

# for some reason, this does not work:
# before_script:
#     - make="make -s"

permissions:
  contents: read

env:
  OSSL_RUN_CI_TESTS: 1

jobs:
  basic_gcc:
    runs-on: ${{ github.server_url == 'https://github.com' && 'ubuntu-latest' || 'ubuntu-22.04-self-hosted' }}
    outputs:
      global_run_id: ${{ steps.capture-job-id.outputs.GLOBAL_RUN_ID }}
    steps:
    - uses: actions/checkout@v4
    - name: checkout fuzz/corpora submodule
      run: git submodule update --init --depth 1 fuzz/corpora
    - name: localegen
      run: sudo locale-gen tr_TR.UTF-8
    - name: fipsvendor
      # Make one fips build use a customized FIPS vendor
      run: echo "FIPS_VENDOR=CI" >> VERSION.dat
    - name: capture job id
      id: capture-job-id
      run: |
        echo "GLOBAL_RUN_ID=${{ github.run_id }}" >> $GITHUB_OUTPUT
    - name: config
      # enable-quic is on by default, but we leave it here to check we're testing the explicit enable somewhere
      run: CC=gcc ./config --banner=Configured enable-demos enable-h3demo enable-fips enable-quic --strict-warnings && perl configdata.pm --dump
    - name: make
      run: make -s -j4
    - name: tar build tree
      run: |
        tar -cvf basic-gcc-build.tgz .
    - name: upload build tree
      uses: actions/upload-artifact@v4
      with:
        name: basic-gcc-build
        path: ./basic-gcc-build.tgz
        retention-days: 1

  basic_gcc_tests:
    needs: basic_gcc
    runs-on: ${{ github.server_url == 'https://github.com' && 'ubuntu-latest' || 'ubuntu-22.04-self-hosted' }}
    env:
      GLOBAL_RUN_ID: ${{ needs.basic_gcc.outputs.global_run_id }}
    steps:
    - name: download build tree
      uses: actions/download-artifact@v4
      with:
        name: basic-gcc-build
        path: .
        run-id: ${GLOBAL_RUN_ID}
    - name: extract artifacts
      run: |
        tar xvf basic-gcc-build.tgz
    - name: get cpu info
      run: |
        ls -l ./util/opensslwrap.sh
        chmod 755 util/opensslwrap.sh
        cat /proc/cpuinfo
        ./util/opensslwrap.sh version -c
    - name: make test
      run: .github/workflows/make-test
    - name: check fipsvendor
      run: |
        util/wrap.pl -fips apps/openssl list -providers | grep 'name: CI FIPS Provider for OpenSSL$'
    - name: save artifacts
      uses: actions/upload-artifact@v3
      with:
        name: "ci@basic-gcc"
        path: artifacts.tar.gz
    - name: delete build artifacts
      if: always()
      uses: geekyeggo/delete-artifact@v5
      with:
        name: basic-gcc-build
        failOnError: false

It splits the build and test into two jobs: a build job, and a test job, with the former serialized behind the latter

Advantages:

Accelerates time to get 'smoke-test' results from just the compile. Whereas the build/test combined job completes in about 5m42s, this split job finishes just the compile in 4m1s.

Disadvantages:

Requires extra work to co-ordinate jobs. The build job needs to tar/archive the build tree, and then the test job needs to download, untar and delete the artifact

Honestly, I'm kind of suprised, I would have thought the build step would have run much more quickly, and saved us more time. I'm going to look at it further to see if theres anything we can do to save some cycles during the build.

That said, thoughts, is this a worthwhile effort to undertake?

nhorman commented 1 month ago

If I add a ccache action to the build job, I can reduce build time from 4m to 1m30s. Thats a good improvement, but I'm not sure we want to use ccache in all cases? perhaps only if the source branch isn't master or a release branch?

t8m commented 1 month ago

If I add a ccache action to the build job, I can reduce build time from 4m to 1m30s. Thats a good improvement, but I'm not sure we want to use ccache in all cases? perhaps only if the source branch isn't master or a release branch?

What would be the reason to not use ccache always?

nhorman commented 1 month ago

My only reasoning would be one of safety. i.e. using cached objects is clearly fine for general build testing, but we might want a full, non-cached build prior to merge. That might just be paranoia on my part, but I figured it was worth asking the question. If there is general consensus on using ccache always, I'm ok with that

t8m commented 1 month ago

My only reasoning would be one of safety. i.e. using cached objects is clearly fine for general build testing, but we might want a full, non-cached build prior to merge. That might just be paranoia on my part, but I figured it was worth asking the question. If there is general consensus on using ccache always, I'm ok with that

We could keep not using ccache in the on-push CI jobs.

nhorman commented 1 month ago

That would be sufficient for my concerns I think

paulidale commented 1 month ago

I don't remember ever having a problem with ccache.

We shouldn't have a manual step to finish the CI process, it needs to be automatic. I don't know how easily this can be done across .yml files, it's not difficult within one.

I'd suggest we identify a number of representative compiles that commonly find problems and run them rather than just one. E.g. some of the cross compiles pick up coding issues that the other jobs miss (m68k is a good one for this), NO_EC, NO_CMS likewise. The cross compiles don't run the tests on PRs (they run on push) and I'd retain that.

t8m commented 1 month ago

I'd suggest we identify a number of representative compiles that commonly find problems and run them rather than just one. E.g. some of the cross compiles pick up coding issues that the other jobs miss (m68k is a good one for this), NO_EC, NO_CMS likewise. The cross compiles don't run the tests on PRs (they run on push) and I'd retain that.

The crosscompiles run a few tests on PRs (namely the evp tests). And these test runs were pretty useful a few times.

paulidale commented 4 weeks ago

Does openssl/openssl#25638 fit in here?

It attempts to run some of the FIPS cross version compatibility checks at PR time rather than merge time.

openssl / project

[Epic] Improve CI for openssl repo #853