teemtee / tmt

Test Management Tool
MIT License
84 stars 125 forks source link

Hardware requirements for a GPU #2154

Closed thrix closed 1 year ago

thrix commented 1 year ago

We have a request to be able to test against AWS instances with a GPU:

https://discussion.fedoraproject.org/t/setting-up-fedora-ci-for-rocm/84373/11

Seems to start, it would be enough to ask for a HW with a dedicated GPU and maybe to say if it is NVIDIA or Intel.

Any more ideas are welcome.

Mystro256 commented 1 year ago

From the thread, for ROCm testing I would need either a vega or navi HW. Anything before that is really buggy and not great for testing. I doubt you have MI HW in AWS, but those are based on vega/navi and are designed for ROCm, so those would obviously work too.

happz commented 1 year ago

How would one look for "a vega or navi HW"? As a GPU noob, I presume it could boil down to something like a "graphic card" name, model names, vendors, something that would be similar to the current CPU requirement specs, https://tmt.readthedocs.io/en/stable/spec/hardware.html#cpu.

00:1e.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 12 [Radeon Pro V520/V540] (rev c3)

A bit of brainstorming: a "vendor" seems to be something to recognize, "Radeon Pro V520/V540" smells feels like cpu.model-name, "Navi" is supposed to be a "code name", the architecture is called different, probably no point trying to fit it into the CPU's "family" or "family-name".

gpu:
    # Probably not precise, might end up way too verbose
    model-name: "~ Radeon Pro .+"
    # Sure, this should be cheap to support, even though it's fairly useless on its own
    vendor: AMD
    # "I would need either a vega or navi HW"
    arch: "~ vega|navi"

It needs to be configurable for instance-type-based providers like AWS or OpenStack, we already can extract CPU info from AWS EC2 describe-instance-types and route model-name: Graviton3 to the right set of instance types, and different instance types may easily share vendor but not model-name; and Beaker needs to expose GPU info, we could then create a filter to match it, even if it would be merged from these distinct keys (I bet it does expose the info, but I don't recall the right XML element for the filter).

Mystro256 commented 1 year ago

We could just filter for vendor AMD and model name containing Radeon, as I doubt much of the older HW is floating around these days. If it's not as easy as greping lscpi, then worse case, I could setup some complex regex for getting the model names that are applicable for the test that I'm doing. I.e. a whitelist of models that would work for the test.

Does that seem feasible?

thrix commented 1 year ago

@Mystro256 @happz so looking further to map this something that is comming out from lshw and lspci:

So I will go with:

gpu:
    product-name: "~ Radeon Pro .+"
    vendor-name: AMD

To comply with the naming advice in https://tmt.readthedocs.io/en/latest/spec/hardware.html#names-and-ids

happz commented 1 year ago

@thrix wouldn't product-name be the same field as device-name from the device specification PR, https://github.com/teemtee/tmt/pull/1759/files#diff-9dd87f09c4ab902df670b30e00fe89d0966a3499985258b1d1f731e52f9fd322R12?

thrix commented 1 year ago

@happz seems like it, well, naming :) I like that it is mapped to what lshw reports, but I have no strong objections unify it