pop-os / cosmic-comp

Compositor for the COSMIC desktop environment
GNU General Public License v3.0
487 stars 90 forks source link

nvidia 545.29.06 broken #221

Closed skygrango closed 11 months ago

skygrango commented 1 year ago

I wanna try keyboard im support, but I can't even launch the desktop properly

desktop show up, but I couldn't move my cursor, it seems like freezing

did I miss something ?

distro : arch up-to-date kernel: linux-cachyos 6.6.1-1, boot with nvidia_drm.modeset=1 graphic card : gtx 1080 driver : nvidia 545.29.02-4 / nvidia-utils 545.29.02-2 pkg : cosmic-epoch-git r101.a83f8dc-1 / cosmic-comp 9a04fa2abdd53cbe4798dcaaf42bea89d8d073d1 env : EGL_PLATFORM=wayland LIBVA_DRIVER_NAME=nvidia GBM_BACKEND=nvidia-drm __GLX_VENDOR_LIBRARY_NAME=nvidia

it show some error in dmesg :

NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
NVRM: VM: invalid mmap
Drakulix commented 1 year ago
  1. Is that the only GPU in your system?
  2. please don't add EGL, GBM and _GLX environment variables to cosmic-comp. Those are meant for applications and can break stuff in compositors.
  3. Can you post the output of journalctl --user _EXE=/usr/bin/cosmic-comp after such a frozen run please?
skygrango commented 1 year ago
  1. Is that the only GPU in your system?

I have iGPU too, but I never use it before. I can check it again on Monday.

  1. please don't add EGL, GBM and _GLX environment variables to cosmic-comp. Those are meant for applications and can break stuff in compositors.

I follow arch wiki and guideline to setup, it work for me on KDE

please don't add `EGL`, `GBM` and `_GLX I'm surprised by your answer, do you mean cosmic-comp supports EGLStream? I never think it will work on wayland, but I can switch to EGLStream to test again

  1. Can you post the output of journalctl --user _EXE=/usr/bin/cosmic-comp after such a frozen run please?

sure, next monday I will do it

skygrango commented 1 year ago

1.

glxinfo | grep "OpenGL renderer"
OpenGL renderer string: NVIDIA GeForce GTX 1080/PCIe/SSE2

2 and 3 It seems that the new version of nvidia driver breaks some compatibility

but I return to version r93 of cosmic-epoch, it still cannot start normally... with env : https://gist.github.com/skygrango/5925679c41db053eebbaddf3ea075dea without env : https://gist.github.com/skygrango/f6d685bec781edb44937cf59d88513bd

skygrango commented 1 year ago

Unable to become drm master, assuming unprivileged mode is interesting..

skygrango commented 1 year ago

[EGL] 0x300c (BAD_PARAMETER) eglQueryDmaBufModifiersEXT: EGL_BAD_PARAMETER error: In eglQueryDmaBufModifiersEXT: Invalid format I think that is nvidia driver bug

although KDE still work for me

Drakulix commented 12 months ago
  1. Is that the only GPU in your system?

I have iGPU too, but I never use it before. I can check it again on Monday.

That doesn't mean some application might not use it.

Can you do ls -l /dev/dri/by-path, figure out which of those is your nvidia gpu (e.g. together with lspci) and then set COSMIC_RENDER_DEVICE=/dev/dri/renderD12X in your environment (with the nvidia gpu as a render device) to make sure cosmic-comp will not use the iGPU.

but I return to version r93 of cosmic-epoch, it still cannot start normally... with env : https://gist.github.com/skygrango/5925679c41db053eebbaddf3ea075dea without env : https://gist.github.com/skygrango/f6d685bec781edb44937cf59d88513bd

Older versions have a bug prohibiting them to work with the 545 driver, you will need latest master.

  1. please don't add EGL, GBM and _GLX environment variables to cosmic-comp. Those are meant for applications and can break stuff in compositors.

I follow arch wiki and guideline to setup, it work for me on KDE

As I said, those are settings for Applications. cosmic-comp uses for example the egl-device and egl-gbm platforms (not the wayland platform as it by itself isn't a wayland-client) and thus these settings don't need to be set for compositors (just for the applications running on it).

please don't add `EGL`, `GBM` and `_GLX I'm surprised by your answer, do you mean cosmic-comp supports EGLStream? I never think it will work on wayland, but I can switch to EGLStream to test again

No, we don't use EGLstreams, which is also why you have to run with nvidia-drm.modeset=1 and the egl-gbm library installed.

  1. Can you post the output of journalctl --user _EXE=/usr/bin/cosmic-comp after such a frozen run please? Unable to become drm master, assuming unprivileged mode is interesting..

[EGL] 0x300c (BAD_PARAMETER) eglQueryDmaBufModifiersEXT: EGL_BAD_PARAMETER error: In eglQueryDmaBufModifiersEXT: Invalid format I think that is nvidia driver bug

Not interesting at all, these are normal on nvidia and don't cause any issues. Can you additionally set RUST_LOG=info please and re-run? The only interesting error is Error rendering, but it is sadly lacking some info.

skygrango commented 12 months ago

Thank you for your detailed explanation !

1.

ls -l /dev/dri/by-path
lrwxrwxrwx 1 root root  8 11月 14 11:52 pci-0000:01:00.0-card -> ../card0
lrwxrwxrwx 1 root root 13 11月 14 11:52 pci-0000:01:00.0-render -> ../renderD128
0000:01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)

I do have only one render, it's good.

  1. I update all submodule again

cosmic-epoch-git r103.6c000aa-1

log : https://gist.githubusercontent.com/skygrango/0b2bea3b050852bb6e7b56e236a60c28/raw/dc290cb58fa34eca64459e6b47a17ba5884d1042/cosmic-comp-no-env-info-3.log

skygrango commented 12 months ago

it show New screen configuration invalid!:, what can i do about this ?

Drakulix commented 12 months ago

11月 14 11:48:16 cosmic-comp[1516]: thread 'main' panicked at 'Malformed config file: SpannedError { code: MissingStructField { field: "data_control_enabled", outer: Some("StaticConfig") }, position: Position { line: 85, col: 1 } }': src/config/mod.rs:184

You're config file is outdated. Please grab the latest one from master: https://github.com/pop-os/cosmic-comp/raw/master_jammy/config.ron

it show New screen configuration invalid!:, what can i do about this ?

Now we are getting somewhere! That this is an atomic configuration error and that we have the configuration cosmic is trying to set is really helpful to narrow this down.

I just need one additional piece of information. Can you run drm_info on any working setup (e.g. KDE) and post the result? It should be on the aur.

skygrango commented 12 months ago

OK !

here is my drm_info : https://gist.github.com/skygrango/2d6ebb4bbcd23fac3600a7eeff9dc094

If you need me to test again, please let me know

Drakulix commented 12 months ago

Ok great, so let me give you a quick rundown of what happens.

We are building a atomic request to setup the screen via the kms-api with a bunch of parameters (so called properties). Building up this list happens in smithay in this function: https://github.com/Smithay/smithay/blob/master/src/backend/drm/surface/atomic.rs#L683

We take a bunch of properties as a given, because they are mandatory by the spec. So we can safely ignore those and in fact we can see, those are set to sensible values in the log:

AtomicModeReq {
    objects: [
        35,
        41,
        85,
    ],
    count_props_per_object: [
        12,
        2,
        1,
    ],
    props: [
        ...
    ],
    values: [
        0,
        0,
        167772160,
        94371840,
        0,
        0,
        2560,
        1440,
        103,
        79,
        41,
        1,
        1,
        98,
        41,
    ],
}

Object 85 is your display port connector, which gets one property, so it's the last one in the list: "41". That is the ID of the crtc (or the CRTC_ID property), which is the second object we will be looking at.

Object 41 is the CRTC and it gets two properties. The property ACTIVE is set to 1 and the MODE_ID is set to 98. (The latter isn't really important and different from the value KDE is setting in your drm-log, because it is a pointer. They likely point to the same mode - 2560x1440@119.87. We don't see the data in the log, but I am pretty certain, that this is correct.)

Which leaves us with Object 35, which is Plane 0. A bunch of these values are pretty obvious, e.g. the first 8 are SRC_X, SRC_Y, SRC_W, SRC_H, CRTC_X, CRTC_Y, CRTC_W, CRTC_H. The 41 points the plane to our CRTC, so that is again CRTC_ID, leaving us with 103, 79 and 1.

Looking at smithay's code and drm_info only three possible candidates remain rotation, FB_ID and IN_FENCE_FD, because the plane has no other properties, that smithay is setting.

Rotation is easy, that is the 1, as Plane 0 just accepts a single value here. FB_ID could be either and is again a pointer, IN_FENCE_FD is a file descriptor and could also be either. But the bad thing here is, that IN_FENCE_FD should never be set, because although the driver exposes this property, it doesn't support any other values than -1 (or unset), because it is lacking the capability DRM_CAP_SYNCOBJ (as seen at the top of your drm_info log).

Which is why the atomic request is rejected by the driver and cosmic fails to put anything on the screen.

Now onto the weird part, I fixed this issues with the nvidia 545 driver weeks ago in smithay: https://github.com/Smithay/smithay/commit/dfa75eaa3d8e9865f8e5cebd04258b8f51cad1cb

So it should look for the syncobj capability, figure out fencing is not supported and never try to send a value to the driver. And on my systems, that works, somehow on yours we still end up with a value here.

Which leaves to options:

  1. Either you are running an outdated version of cosmic
  2. It somehow still ends up with this value

So first off, are you sure you are running the right version? cosmic-epoch-git r103.6c000aa-1

This seems suspicious to me, as the AUR package lists r99-4a6621a-1.

Also 6c000aa doesn't even resolve to a known commit of that repository: https://github.com/pop-os/cosmic-epoch/commit/6c000aa

If you did update the submodules locally, note that just changing their commits doesn't check out the new state automatically.

4a6621a does, but is indeed to old. I'll update the cosmic-epoch repository to fix that.

If it turns out to be option 2, how is your rust experience? Could I ask you to debug this with a few more hints? Or would it be better, if I just clutter the log with more details to hopefully figure out remotely, how we end up in this state?

Drakulix commented 12 months ago

cosmic-epoch updated.

skygrango commented 12 months ago
  1. here is my fork : https://github.com/skygrango/cosmic-epoch I can rebase and update submodule again, can you help me to check ?

  2. I have some experience in rust development, but I’m not familiar with drm. what I can do for you ? change log level ? add some debug print ? maybe you have to tell me where I should insert a print

skygrango commented 12 months ago

I rebase my fork here : https://github.com/skygrango/cosmic-epoch/commits/master

and left the old one cosmic-epoch-git r103.6c000aa-1 here to let you check : https://github.com/skygrango/cosmic-epoch/commits/master_old

Drakulix commented 12 months ago

I rebase my fork here : https://github.com/skygrango/cosmic-epoch/commits/master

and left the old one cosmic-epoch-git r103.6c000aa-1 here to let you check : https://github.com/skygrango/cosmic-epoch/commits/master_old

They both look fine, the question is how are you building that? With the AUR package? Or by manually building? If it's the latter, you need to make sure to not just git pull, but also update your submodules with git submodule update --init --recursive

skygrango commented 12 months ago

I clone the aur package and modify PKGBUILD to link to my fork, then makepkg. done.

I use git submodule update --remote to update submodule, this seems to work well :)

I will modify aur tomorrow so that the new submodule can be compiled

I left my 7900xtx drm_info here, it work : https://gist.github.com/skygrango/168a042c39b8a1740bf93507290375be

skygrango commented 12 months ago

hey, I found that in https://github.com/pop-os/cosmic-comp/blob/master_jammy/Cargo.toml

[dependencies.smithay]
version = "0.3"
git = "https://github.com/smithay/smithay.git"
rev = "74ef59a3f"

maybe we just need to update since this is older than the fix you mentioned https://github.com/Smithay/smithay/commit/dfa75eaa3d8e9865f8e5cebd04258b8f51cad1cb

skygrango commented 12 months ago

oh sorry, just found that

[patch."https://github.com/Smithay/smithay.git"]
smithay = { git = "https://github.com/smithay//smithay", rev = "d5b352b" }
andyczerwonka commented 12 months ago

I logged https://github.com/alacritty/alacritty/issues/7372 and https://github.com/obsproject/obs-studio/issues/9870 this morning when the new 545 driver came through. I reverted back to 535 and both are now back to a working state.

skygrango commented 12 months ago

I think so, It's nvidia problem even though KDE still work

@Drakulix maybe instead of wasting your energy, let's close this issue first ?

If you still want to know some error messages, I can still help provide information

skygrango commented 12 months ago

I saw this commit : https://github.com/elFarto/nvidia-vaapi-driver/commit/98887098da50b9acff686a1a0e468df3926b47b2

nvidia made stupid design changes ...

//NVIDIA driver v545.29.02 changed the devInfo struct, and partly broke it in the process //...who adds a field to the middle of an existing struct....

Drakulix commented 12 months ago

I think so, It's nvidia problem even though KDE still work

Its a problem specific to the nvidia-driver, but not a problem of the driver. smithay sends a fence, when it shouldn't, but I am not convinced yet, that you are using a indeed using a recent enough build of cosmic.

@Drakulix maybe instead of wasting your energy, let's close this issue first ?

Feel free to close this issue at any time, I am just trying to help you with your problem.

If you still want to know some error messages, I can still help provide information

Sure, lets do that.

Try changing this line please to surface.surface = Some(dbg!(target)); and make a debug build of cosmic-comp (cargo build, not cargo build --release !). Then try that and please post the logs again. :)

I saw this commit : elFarto/nvidia-vaapi-driver@9888709

//NVIDIA driver v545.29.02 changed the devInfo struct, and partly broke it in the process //...who adds a field to the middle of an existing struct....

nvidia-vaapi-driver is directly using the unstable nvapi, so there is no "stupid" decision here, they never committed to a stable api in the first place. So changes like these for the 545 driver are absolutely expected.

skygrango commented 11 months ago

I made a fork of cosmic-comp

log : https://gist.github.com/skygrango/e183a2f1b386a9c7d5a4ac1dd06cb184

skygrango commented 11 months ago

try to run cosmic-comp in tty

log: https://gist.github.com/skygrango/2260f78894ed260bacfb2c2deff92a25

Drakulix commented 11 months ago

try to run cosmic-comp in tty

log: https://gist.github.com/skygrango/2260f78894ed260bacfb2c2deff92a25

Looks completely fine. Seems like you let it run for 5 seconds, before switching tty again.

skygrango commented 11 months ago

my mouse can't move, what could be the reason?

skygrango commented 11 months ago

the situation is not good, because the desktop is slow to show up, it may cost 30 secs to show desktop, and I can't move my mouse even if I try to switch to a different tty, it takes more than 10 seconds to work, what should I do to improve it ? any suggestions for environment variables?

previous driver version of 535 did not have such slowness, and I could use the mouse normally

Drakulix commented 11 months ago

the situation is not good, because the desktop is slow to show up, it may cost 30 secs to show desktop, and I can't move my mouse even if I try to switch to a different tty, it takes more than 10 seconds to work, what should I do to improve it ? any suggestions for environment variables?

No environment variables, I honestly have no idea, as you don't have any errors in your log and I don't have a machine that replicates this issue.

previous driver version of 535 did not have such slowness, and I could use the mouse normally

I would suggest downgrading for the time being in that case. Possibly open an issue with nvidia, I would hope future updates will fix this on your system.

skygrango commented 11 months ago

That sounds very reasonable, let's move on. Thank you for your support!

Drakulix commented 11 months ago

That sounds very reasonable, let's move on. Thank you for your support!

Thank you for being so patient with this bug.

There are other reports for problems around the new synchronization mechanism of the 545 driver, I am hopeful that later versions with resolve this, but feel free to re-open once the next driver version lands, if this is still not fixed.

skygrango commented 11 months ago

I updated cosmic-comp and tried new driver of nvidia 545.29.06

log : https://gist.github.com/skygrango/a14fb376ca51be273bef8000a481b99a

it show

Compositor bug: Server ignored ImportNotifier for ZwpLinuxBufferParamsV
 { id: ObjectId(zwp_linux_buffer_params_v1@51), version: 4, data: Some(Any { .. }),
handle: WeakHandle { handle: WeakInnerHandle[sys] { .. } } }

545.29.06 driver also does not work properly If there is no useful information, we can close the issue again

Drakulix commented 11 months ago

log : https://gist.github.com/skygrango/a14fb376ca51be273bef8000a481b99a

Not a debug log, but the error is again "Error rendering", which hints at the same drm/fence issue as the previous driver version... :/

skygrango commented 11 months ago

I'm sorry for forgetting to change log level

here is new one : https://gist.github.com/skygrango/2256086a36e3ee6c7e5deb4b206bdd81

started from tty : https://gist.githubusercontent.com/skygrango/b47770839bb1dfc3b187c802679eb9a7/raw/1f73599355d27c2d03f9d1ee1bd532d70188ffcd/cosmic-comp-dbg-tty.log

tty log has DrmCompositor info if you need it

Drakulix commented 11 months ago

I'm sorry for forgetting to change log level

here is new one : https://gist.github.com/skygrango/2256086a36e3ee6c7e5deb4b206bdd81

started from tty : https://gist.githubusercontent.com/skygrango/b47770839bb1dfc3b187c802679eb9a7/raw/1f73599355d27c2d03f9d1ee1bd532d70188ffcd/cosmic-comp-dbg-tty.log

tty log has DrmCompositor info if you need it

Both logs look perfectly fine, not even a rendering error, all good until the tty-switch. What results were you seeing exactly here? Still a rendered, but otherwise unresponsive desktop?

skygrango commented 11 months ago

Both logs look perfectly fine, not even a rendering error, all good until the tty-switch. What results were you seeing exactly here? Still a rendered, but otherwise unresponsive desktop?

Yes, but I probably need to make slight corrections : The mouse is responsive, but may move once every 30 seconds. :)

skygrango commented 11 months ago

we should wait for the next nvidia driver update

skygrango commented 8 months ago

nvidia 550.54.14 work !