[ CI ] Fan Out Strategy

robertgshaw2-neuralmagic commented 4 months ago

SUMMARY:

update nm-build-test workflow to run each test group on separate gpu
update nm-test to receive a specific test directory
convert nm-test-whl action to receive one test directory for test group
remove concept of skip lists
remove gptq models (marlin is too flaky)

ADVANTAGES:

much faster wall clock time for the test
ability to re-run just the failed jobs for spurious cases
possible to

DISADVANTAGES:

testmo tracking is a bit more complex --- since each test group is no a separate run
code coverage tracking is now very complex --- since no single run goes over all tests ---> mitigant: we could have a single run for code coverage which runs over all the tests that we run ~weekly

FOLLOW UP PR:

enable DISTRIBUTED
randomly assign various python versions
update the names of the TEST workflow somehow so that in the GH UI we can see more easily which test group failed

dbarbuzzi commented 4 months ago

Can we update the test job’s name property to include something dynamic that is relevant to that specific instance so they can be differentiated in the GitHub UI list (e.g., inputs.test_directory)? This has to happen at the job-level in the last workflow that is called (e.g., the TEST job in .github/workflows/nm-test.yml could have something like name: TEST (${{ inputs.test_directory }})).

Also, they all have separate test runs in Testmo; is that the desired result, or would we want to maintain the previous behavior of having them consolidated into a single run? If using a single run, we could still submit results individually since we're already submitting results as threads, which is appropriate for the new approach.

robertgshaw2-neuralmagic commented 4 months ago

Can we update the test job’s name property to include something dynamic that is relevant to that specific instance so they can be differentiated in the GitHub UI list (e.g., inputs.test_directory)? This has to happen at the job-level in the last workflow that is called (e.g., the TEST job in .github/workflows/nm-test.yml could have something like name: TEST (${{ inputs.test_directory }})).

Also, they all have separate test runs in Testmo; is that the desired result, or would we want to maintain the previous behavior of having them consolidated into a single run? If using a single run, we could still submit results individually since we're already submitting results as threads, which is appropriate for the new approach.

@dbarbuzzi

It would be better if these could all be part of a single run (and ideally if we could add the lm-eval tests to that run as well --- which are not currently tracked in testmo at all). Is this something you could take on?

I think we should do this as part of a separate PR though

neuralmagic / nm-vllm

[ CI ] Fan Out Strategy #325