Hardware requirements for a GPU

thrix commented 1 year ago

We have a request to be able to test against AWS instances with a GPU:

https://discussion.fedoraproject.org/t/setting-up-fedora-ci-for-rocm/84373/11

Seems to start, it would be enough to ask for a HW with a dedicated GPU and maybe to say if it is NVIDIA or Intel.

Any more ideas are welcome.

Mystro256 commented 1 year ago

From the thread, for ROCm testing I would need either a vega or navi HW. Anything before that is really buggy and not great for testing. I doubt you have MI HW in AWS, but those are based on vega/navi and are designed for ROCm, so those would obviously work too.

happz commented 1 year ago

How would one look for "a vega or navi HW"? As a GPU noob, I presume it could boil down to something like a "graphic card" name, model names, vendors, something that would be similar to the current CPU requirement specs, https://tmt.readthedocs.io/en/stable/spec/hardware.html#cpu.

00:1e.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 12 [Radeon Pro V520/V540] (rev c3)

A bit of brainstorming: a "vendor" seems to be something to recognize, "Radeon Pro V520/V540" smells feels like cpu.model-name, "Navi" is supposed to be a "code name", the architecture is called different, probably no point trying to fit it into the CPU's "family" or "family-name".

gpu:
    # Probably not precise, might end up way too verbose
    model-name: "~ Radeon Pro .+"
    # Sure, this should be cheap to support, even though it's fairly useless on its own
    vendor: AMD
    # "I would need either a vega or navi HW"
    arch: "~ vega|navi"

It needs to be configurable for instance-type-based providers like AWS or OpenStack, we already can extract CPU info from AWS EC2 describe-instance-types and route model-name: Graviton3 to the right set of instance types, and different instance types may easily share vendor but not model-name; and Beaker needs to expose GPU info, we could then create a filter to match it, even if it would be merged from these distinct keys (I bet it does expose the info, but I don't recall the right XML element for the filter).

Mystro256 commented 1 year ago

We could just filter for vendor AMD and model name containing Radeon, as I doubt much of the older HW is floating around these days. If it's not as easy as greping lscpi, then worse case, I could setup some complex regex for getting the model names that are applicable for the test that I'm doing. I.e. a whitelist of models that would work for the test.

Does that seem feasible?

thrix commented 1 year ago

@Mystro256 @happz so looking further to map this something that is comming out from lshw and lspci:

my localhost

    *-display
         description: VGA compatible controller
         product: Alder Lake-P Integrated Graphics Controller
         vendor: Intel Corporation
         physical id: 2
         bus info: pci@0000:00:02.0
         logical name: /dev/fb0
         version: 0c
         width: 64 bits
         clock: 33MHz
         capabilities: vga_controller bus_master cap_list rom fb
         configuration: depth=32 driver=i915 latency=0 resolution=1920,1200
         resources: iomemory:600-5ff iomemory:400-3ff irq:165 memory:603c000000-603cffffff memory:4000000000-400fffffff ioport:2000(size=64) memory:c0000-dffff memory:4010000000-4016ffffff memory:4020000000-40ffffffff

aws nitro instance

[root@ip-172-31-28-199 ~]# lshw  -C display
*-display UNCLAIMED       
   description: VGA compatible controller
   product: Amazon.com, Inc.
   vendor: Amazon.com, Inc.
   physical id: 3
   bus info: pci@0000:00:03.0
   version: 00
   width: 32 bits
   clock: 33MHz
   capabilities: vga_controller
   configuration: latency=0
   resources: memory:fe400000-fe7fffff memory:c0000-dffff

my desktop

$ lshw -C display
*-display                 
   description: VGA compatible controller
   product: G86 [Quadro NVS 290]
   vendor: NVIDIA Corporation
   physical id: 0
   bus info: pci@0000:01:00.0
   version: a1
   width: 64 bits
   clock: 33MHz
   capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
   configuration: driver=nouveau latency=0
   resources: irq:29 memory:f2000000-f2ffffff memory:e0000000-efffffff memory:f0000000-f1ffffff ioport:1100(size=128)

So I will go with:

gpu:
    product-name: "~ Radeon Pro .+"
    vendor-name: AMD

To comply with the naming advice in https://tmt.readthedocs.io/en/latest/spec/hardware.html#names-and-ids

happz commented 1 year ago

@thrix wouldn't product-name be the same field as device-name from the device specification PR, https://github.com/teemtee/tmt/pull/1759/files#diff-9dd87f09c4ab902df670b30e00fe89d0966a3499985258b1d1f731e52f9fd322R12?

thrix commented 1 year ago

@happz seems like it, well, naming :) I like that it is mapped to what lshw reports, but I have no strong objections unify it

teemtee / tmt

Hardware requirements for a GPU #2154