siderolabs / pkgs

Mozilla Public License 2.0
31 stars 103 forks source link

Enable Intel Management Engine Interface (MEI) #700

Closed uhthomas closed 1 month ago

uhthomas commented 1 year ago

It looks like the Intel Management Engine Interface (MEI) is required to use Intel Arc. The i915 firmware does not work as the HuC firmware will fail to load.

See:

https://github.com/jellyfin/jellyfin/issues/9588

https://github.com/uhthomas/pkgs/commit/6a83361b8e4facfe551657e49bc71fa114d8be0f

https://gitlab.freedesktop.org/drm/intel/-/issues/7732

PrivatePuffin commented 1 month ago

@smira this should be moved to the official extentions repo, as it requires an extention. Which I think is a reasonable fit as an official extention.

smira commented 1 month ago

Yes, sure, but the issue itself is not actionable

PrivatePuffin commented 1 month ago

Yes, sure, but the issue itself is not actionable

In what sense is it not-actionable?

smira commented 1 month ago

I for example have no idea what needs to be done, if it was a PR or detailed steps, it would be easier.

PrivatePuffin commented 1 month ago

Agreed... They are speaking in enabling kernel flags in the linked issues...

However... Ive checked the Talos kernel flags and this isnt disabled, afaik it should be enabled when unset.

I'll continue doing some more research and see if I can get a more specific pointer to work from. My arc card just arrived to test!

e3b0c442 commented 1 month ago

afaik it should be enabled when unset

I do not believe this is the case.

The second link appears to have the exact kernel flags that need to be set; I'm working on a build with those flags this morning (this is my first experience building a custom kernel for Talos so I'm feeling my way through the process here); I will report back.

e3b0c442 commented 1 month ago

Confirmed on the kernel config.

Additionally, I was able to see during kernel configuration that the flag CONFIG_INTEL_MEI can also be set as a module, so unless I'm mistaken, we should indeed be able to achieve this with an extension similar to the NVIDIA GPU extension by packaging the modules.

This is likely beyond my level of expertise, but I'm at least going to try and get this kernel built and boot assets created so I can test this on my own as a working solution.

PhilippWitzmann commented 1 month ago

Thank you for investigating this. I've spent the whole day debugging my jellyfin installation and talos cluster after trying to use my new arc gpu, which ultimately led me here.

Happy to help.

uhthomas commented 1 month ago

Glad my foundational work was helpful, thanks for verifying it.

I wonder what the right thing to do here is? An Intel ME extension? Or should the i915 extension be updated to enable the Intel management engine?

e3b0c442 commented 1 month ago

I think a separate ME extension makes more sense:

Of course, this is just one person's take.

e3b0c442 commented 1 month ago

OK, potential slight snag, CONFIG_DRM_I915_PXP is a yes/no so it does need to be enabled in the kernel proper. May need more changes than just an extension.

e3b0c442 commented 1 month ago

OK. This seems to simple to be true (kudos to the devs here if it really is this simple).

If I've got this correct, we should just need to:

Then load the intel-me and i915-ucode extensions in the machine config, once the above are released.

Does that sound right?

PrivatePuffin commented 1 month ago

OK. This seems to simple to be true (kudos to the devs here if it really is this simple).

If I've got this correct, we should just need to:

  • update the kernel config in the pkgs repo to add the ME modules
  • create an extension containing the ME (e.g. intel-me) modules in the extensions repo

Then load the intel-me and i915-ucode extensions in the machine config, once the above are released.

Does that sound right?

Afaik config in the pkgs repo is always loaded, so if you set CONFIG_DRM_I915_PXP there, for example its always loaded. I'm not sure that plays nice with extensions or does it?

smira commented 1 month ago

The kernel build from pkgs doesn't fully go into Talos (kernel modules), only some modules are shipped by default, others via extensions.

You can use this for inspiration - https://github.com/siderolabs/extensions/tree/main/drivers/usb-modem

e3b0c442 commented 1 month ago

All right, I was able to munge through the docs to get the custom installer built with the custom kernel and modules and the i915 microcode. Unfortunately, I think something is still missing/awry, as Plex looked on the dashboard like it was going to start using the hardware encoder, but then the transcoder segfaulted:

go: kern:    info: [2024-07-18T04:17:00.920239299Z]: Plex Transcoder[15773]: segfault at 0 ip 00007fee03bbac07 sp 00007ffe5ac663b0 error 4 in libigdrcl.so[7fee0392d000+3c4000] likely on CPU 8 (core 16, socket 0)
go: kern:    info: [2024-07-18T04:17:01.101769299Z]: Code: 44 8b ab bc 07 00 00 41 c6 84 24 08 01 00 00 00 49 8d 44 24 08 49 89 04 24 f6 43 2c 01 74 50 48 8b 45 c8 48 8b b8 a0 00 00 00 <48> 8b 07 4c 8b 58 38 e8 dd 31 da ff 84 c0 75 35 83 bb e0 06 00 00

I need to dig a bit and make sure we aren't missing any necessary modules/drivers. I have pushed what I built last night to a public registry and it's at ghcr.io/e3b0c442/talos-installer:v1.7.5-mei. This has the custom kernel/modules and the i915-ucode extension, if anyone else wants to kick the tires/try troubleshooting. The kernel config I used in my machineconfig is:

    kernel:
      modules:
        - name: mei_hdcp
        - name: mei-gsc
        - name: mei-me
        - name: mei-txe
        - name: mei
        - name: mei_pxp
        - name: mei_wdt

I'm still waiting for the green light from my employer before I can submit code changes. Hopefully that will come before I figure out what's still failing on the Plex side. Alternatively, if anyone wants to pick up the baton in that regard I would have no complaints. :)

e3b0c442 commented 1 month ago

The segfault is a kernel bug, Plex is working on a workaround: https://github.com/tteck/Proxmox/discussions/3162

After making the suggested change to disable tone mapping hardware encoding is working as expected. As soon as I get clearance to submit PRs I will do so; otherwise it does appear that we just need to update the kernel config as in https://github.com/uhthomas/pkgs/commit/6a83361b8e4facfe551657e49bc71fa114d8be0f and then create an MEI extension, if somebody else wants to run with this.

//edit: another reference, the issue is a kernel bug in >=6.6.26 https://github.com/jellyfin/jellyfin/issues/11380

//edit2: looks to be resolved in kernel 6.6.31. I'm not sure if another v1.7 patch release is in plan (but also hoping we can get these changes in before v1.8)

//edit3: hmmm... the build I put out has kernel 6.6.33 so it must not actually be resolved. Will need to dig more.

PhilippWitzmann commented 1 month ago

I've updated my machine using ghcr.io/e3b0c442/talos-installer:v1.7.5-mei, patched the talos config using the above mentioned modules and got it running. Jellyfin is also happily decoding using Intel QSV on an Intel Arc A380 now. Will keep testing tonight for stability. Thanks so much!

//edit: stable transcoding for ~10hrs with vpp tone mapping and intel low power enabled

e3b0c442 commented 1 month ago

I have just gotten the necessary approval from my employer to submit contributions. I need to wait for the final paperwork to come through -- hopefully by the end of week -- then I'll be able to open PRs with these changes against the repos and hopefully 🤞 get this in for the 1.8 release, maintainers willing. :)