qiboteam / qibotn

The tensor-network translation module for Qibo.
https://qibo.science
Apache License 2.0
3 stars 1 forks source link

Improve workflows for CUDA availability #28

Open alecandido opened 4 months ago

alecandido commented 4 months ago
          @liweintu the fix I proposed would work, but effectively they are the same of deleting the workflow file.

Here I have two alternatives:

  1. since not all tests require CUDA, just mark those for which is necessary with @pytest.mark.cuda, and filter them in conftest.py according to the value of the environment variable, within pytest_runtest_setup
  2. f you want to run this very same workflow file on a self-hosted runner, then you would have to do something more complex: I don't believe you can access the underlying env vars by default, since for the GitHub runners there are the Variables for that, and it is written explicitly in the docs for the env context

    It does not contain variables inherited by the runner process.

    However, you could run a first preliminary job, read the environment variable within bash code, and set in the $GITHUB_ENV file, this should make it available in the env context (I'm not completely sure myself, I should test it)

Option 1. is much more reasonable to me, because it is more flexible (you could still run some tests everywhere), it's defined in Python, so you're already familiar with it (instead of looking around in GitHub docs), and you can test it offline on various systems, without having to fiddle with GitHub actions.

Originally posted by @alecandido in https://github.com/qiboteam/qibotn/pull/24#pullrequestreview-1869833153

liweintu commented 4 months ago
  1. since not all tests require CUDA, just mark those for which is necessary with @pytest.mark.cuda, and filter them in conftest.py according to the value of the environment variable, within pytest_runtest_setup

Yup, this @pytest.mark.cuda marker sounds good to me. We can use it to cover CUDA-related pytest functions.

  1. f you want to run this very same workflow file on a self-hosted runner, then you would have to do something more complex: I don't believe you can access the underlying env vars by default, since for the GitHub runners there are the Variables for that, and it is written explicitly in the docs for the env context

    It does not contain variables inherited by the runner process.

    However, you could run a first preliminary job, read the environment variable within bash code, and set in the $GITHUB_ENV file, this should make it available in the env context (I'm not completely sure myself, I should test it)

Option 1. is much more reasonable to me, because it is more flexible (you could still run some tests everywhere), it's defined in Python, so you're already familiar with it (instead of looking around in GitHub docs), and you can test it offline on various systems, without having to fiddle with GitHub actions.

Originally posted by @alecandido in #24 (review)

As for option 2., the self-hosted runner is desired for covering the tests of GPU backends. Actually, we are providing a GPU-server (most likely RTX3090) as the self-hosted runner. Then, we can use the preceding check job to do the preliminary checking on CUDA availability. This part is already there in the code, now, the key is to set up the self-hosted runner in a proper way.

alecandido commented 4 months ago

As for option 2., the self-hosted runner is desired for covering the tests of GPU backends. Actually, we are providing a GPU-server (most likely RTX3090) as the self-hosted runner. Then, we can use the preceding check job to do the preliminary checking on CUDA availability. This part is already there in the code, now, the key is to set up the self-hosted runner in a proper way.

Concerning the check job, I believe it won't be needed any longer.

Every job should have a runs-on: key, specifying where to run. If you wish to self-host, the value should be self-hosted (or some other self-hosted label): https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/using-self-hosted-runners-in-a-workflow#using-default-labels-to-route-jobs https://github.com/qiboteam/qibolab/blob/d080141c9a4fd2160e5b159205eded3256333c75/.github/workflows/selfhosted.yml#L46

So, you can now add a matrix, and run the exact same job on both runners. However, using the check job, you can now determine the environment variable. But then you want to run on both anyhow, because the CPU will run on GitHub, and CPU + GPU (or GPU only, in case we wanted to save resources) on the self-hosted. Thus, in this setup it's not much useful to have a check job, because there are no jobs to filter at the workflow level, if you can filter tests at the Pytest level (and you can, since in Pytest you'll be able to access the environment). Simply, if you want to share the build job in both (and it would make sense) the mark should not be added in the workflow, but it should be resolved in a Pytest hook.

If you're worried instead about the actual build (that job is called like that by accident, propagating the content of workflows@v1, fixed in v2), that has a separate workflow, and it does not require GPU. So, it will keep building on GitHub, as it is doing right now :)

liweintu commented 4 months ago

Thus, in this setup it's not much useful to have a check job, because there are no jobs to filter at the workflow level, if you can filter tests at the Pytest level (and you can, since in Pytest you'll be able to access the environment). Simply, if you want to share the build job in both (and it would make sense) the mark should not be added in the workflow, but it should be resolved in a Pytest hook.

I see. So now, the key is using Pytest level for filtering tests, instead of a preceding check job. Fair enough.

If you're worried instead about the actual build (that job is called like that by accident, propagating the content of workflows@v1, fixed in v2), that has a separate workflow, and it does not require GPU. So, it will keep building on GitHub, as it is doing right now :)

Indeed, previously I was a bit worried on the build job that involved GPU packages 'cos that's where it failed in CI. And the check job is mostly created to work around this. Thanks for the fix in v2.