grate-driver / mesa

Mesa fork for open-source NVIDIA Tegra20/30 GL implementation
33 stars 15 forks source link

ninja install fails with install_megadrivers.py error #10

Open emulti opened 4 years ago

emulti commented 4 years ago

I'm trying to update my Toshiba AC100 with the latest grate-driver (kernel 5.7.8, Arch Linux Arm) using the instructions on https://github.com/grate-driver/grate/wiki/Grate-driver:

Building natively on the AC100.

1.  git clone https://github.com/grate-driver/mesa.git
2.     cd mesa
3.     meson -Dprefix=/usr -Dgallium-drivers=grate -Ddri-drivers=swrast -Dplatforms=x11,drm -Dshared-glapi=true -Dgbm=true -Dglx=dri -Dosmesa=none -Dgles1=false -Dgles2=true -Degl=true -Dgallium-xa=false -Dgallium-vdpau=false -Dgallium-va=false -Dgallium-xvmc=false -Duse-elf-tls=false -Dgallium-nine=false -Db_ndebug=true -Dvulkan-drivers= -Dlibunwind=false -Dllvm=false build
4.     cd build/
5.     ninja && ninja install

Step 3 failed at first with 'no choice for 'grate'' error. I added 'grate' to the array under gallium-drivers in meson_option.txt and was then able to build mesa master branch with ninja in the first part of step 5.

However, the 'sudo ninja install' command fails:

[1/15] Generating git_sha1.h with a custom command [1/2] Installing files. installing /home/chris/Downloads/build/mesa/build/src/mesa/drivers/dri/libmesa_dri_drivers.so to /usr/lib/dri/swrast_dri.so Installing src/mapi/shared-glapi/libglapi.so.0.0.0 to /usr/lib Installing src/mapi/es2api/libGLESv2.so.2.0.0 to /usr/lib Installing src/mesa/drivers/dri/libmesa_dri_drivers.so to /usr/lib/dri Installing src/glx/libGL.so.1.2.0 to /usr/lib Installing src/gbm/libgbm.so.1.0.0 to /usr/lib Installing src/egl/libEGL.so.1.0.0 to /usr/lib Installing src/gallium/targets/dri/libgallium_dri.so to /usr/lib/dri Installing /home/chris/Downloads/build/mesa/include/KHR/khrplatform.h to /usr/include/KHR Installing /home/chris/Downloads/build/mesa/include/GLES2/gl2.h to /usr/include/GLES2 Installing /home/chris/Downloads/build/mesa/include/GLES2/gl2ext.h to /usr/include/GLES2 Installing /home/chris/Downloads/build/mesa/include/GLES2/gl2platform.h to /usr/include/GLES2 Installing /home/chris/Downloads/build/mesa/include/GLES3/gl3.h to /usr/include/GLES3 Installing /home/chris/Downloads/build/mesa/include/GLES3/gl31.h to /usr/include/GLES3 Installing /home/chris/Downloads/build/mesa/include/GLES3/gl32.h to /usr/include/GLES3 Installing /home/chris/Downloads/build/mesa/include/GLES3/gl3ext.h to /usr/include/GLES3 Installing /home/chris/Downloads/build/mesa/include/GLES3/gl3platform.h to /usr/include/GLES3 Installing /home/chris/Downloads/build/mesa/include/GL/gl.h to /usr/include/GL Installing /home/chris/Downloads/build/mesa/include/GL/glcorearb.h to /usr/include/GL Installing /home/chris/Downloads/build/mesa/include/GL/glext.h to /usr/include/GL Installing /home/chris/Downloads/build/mesa/include/GL/glx.h to /usr/include/GL Installing /home/chris/Downloads/build/mesa/include/GL/glxext.h to /usr/include/GL Installing /home/chris/Downloads/build/mesa/include/EGL/egl.h to /usr/include/EGL Installing /home/chris/Downloads/build/mesa/include/EGL/eglext.h to /usr/include/EGL Installing /home/chris/Downloads/build/mesa/include/EGL/eglplatform.h to /usr/include/EGL Installing /home/chris/Downloads/build/mesa/include/EGL/eglmesaext.h to /usr/include/EGL Installing /home/chris/Downloads/build/mesa/include/EGL/eglextchromium.h to /usr/include/EGL Installing /home/chris/Downloads/build/mesa/include/GL/internal/dri_interface.h to /usr/include/GL/internal Installing /home/chris/Downloads/build/mesa/src/gbm/main/gbm.h to /usr/include Installing /home/chris/Downloads/build/mesa/src/util/00-mesa-defaults.conf to /usr/share/drirc.d Installing /home/chris/Downloads/build/mesa/build/meson-private/glesv2.pc to /usr/lib/pkgconfig Installing /home/chris/Downloads/build/mesa/build/meson-private/dri.pc to /usr/lib/pkgconfig Installing /home/chris/Downloads/build/mesa/build/meson-private/gbm.pc to /usr/lib/pkgconfig Installing /home/chris/Downloads/build/mesa/build/meson-private/egl.pc to /usr/lib/pkgconfig Installing /home/chris/Downloads/build/mesa/build/meson-private/gl.pc to /usr/lib/pkgconfig Running custom install script '/usr/bin/python /home/chris/Downloads/build/mesa/bin/install_megadrivers.py /home/chris/Downloads/build/mesa/build/src/mesa/drivers/dri/libmesa_dri_drivers.so /usr/lib/dri swrast_dri.so' Running custom install script '/usr/bin/python /home/chris/Downloads/build/mesa/bin/install_megadrivers.py /home/chris/Downloads/build/mesa/build/src/gallium/targets/dri/libgallium_dri.so /usr/lib/dri' FAILED: meson-install /usr/bin/meson install --no-rebuild ninja: build stopped: subcommand failed.

It looks like there is an argument missing to the install_megadrivers.py script for the libgallium_dri.so file. Can you offer advice on how to fix this error please? I am new to the Meson build system, so don't know where the install script is taking values from.

kusma commented 4 years ago

Sounds like you're trying to use the wrong Mesa branch. We don't have a grate driver upstream, only our own fork. And that driver isn't useful for much more than running glxgears.

emulti commented 4 years ago

Thanks, seems it's the Mesa 19.3 branch that contains the grate driver. I will try and build that one. The last build I did was 13 May 2019 when Autotools were still used. I am happy to help with testing if there is interest. The AC100 is a nice little device though constrained with RAM and storage.

digetx commented 4 years ago

The 19.3 is the most actual branch and indeed you could only run glxgears using the current Mesa driver. Besides 3D, Mesa is also useful for the libvdpau-tegra because libvdpau core uses DRI for retrieving the VDPAU driver name, otherwise you'll need to manually specify the driver name in environment variables.

Help with the testing is very appreciated! And you could do quite a lot things on AC100 without a 3D driver!

All mobile devices are very resource-constrained and this is a big part of the development fun to optimize everything :)

emulti commented 4 years ago

Thanks for the info. I can confirm the 19.3 branch of Mesa builds fine, along with the other components including libvdpau-tegra. Maybe put a note on the wiki page to use 193. branch rather than Master? I also built the libraries in 'Grate' which get installed in /usr/local/lib by default. Is there an associated test application? The AC100 hardware is very nice, well balanced with a good keyboard. It runs nicely with i3 and lightweight apps like Sylpheed, Gnumeric, Abiword and company, and I got rid of a lot of bloat to make space on the EMMC. There are some kernel oops during boot associated with clk.c and the nvec keyboard/touchpad doesn't always work on every boot. Nothing to do with Xorg/Mesa of course, but do you know where bug reports can be submitted?

digetx commented 4 years ago

I guess it should be better to replace the master branch with 19.3. Actually, I was going to update the master sometime ago, but haven't got to it yet. It will be fixed sometime soon, thank you for getting attention to it!

The 'Grate' test applications aren't installed, but you could run them manually by executing tests/grate/* tests/host1x/*. Also, there is no need to install the libraries because they are not used anywhere. There was intention to utilize the libgrate in the past, but then plans changed.

The vanilla upstream 5.7 kernel is known to produce the clk warnings, they are harmless and eventually will be fixed once patches will be backported from 5.8.

The NVEC driver isn't actively maintained, so should be more productive if you could submit patches instead of the bug reports :)

emulti commented 4 years ago

Thanks again. I built the grate-driver kernel (5.8.0-rc4...) and that doesn't have the problem with clk.c oops or the nvec failing to reset/touchpad freezing. I also rebuilt the libdrm, xf86-opentegra, mesa and libvdpau-grate packages. While I would be happy to submit patches sadly it's beyond my technical capability.

However, I did find an issue with the 5.8.0-rc4 kernel on AC100. It is not present with mainline 5.7.8.

After a cold start from power off, Xorg fails to start with repeated messages: Jul 18 09:22:23 alarm kernel: [drm] tegra_drm_sched_timedout_job: 3d channel: pipes 0x2 (process:Xorg pid:616) Jul 18 09:22:23 alarm kernel: [drm:tegra_drm_sched_timedout_job] ERROR 3d channel: pipes 0x2 (process:Xorg pid:616) Jul 18 09:22:23 alarm kernel: tegra-gr3d 54180000.gr3d: [drm:tegra_drm_sched_timedout_job] resetting hardware Jul 18 09:22:23 alarm kernel: [drm] tegra_drm_sched_timedout_job: 3d channel: pipes 0x2 (process:Xorg pid:616) Jul 18 09:22:23 alarm kernel: [drm:tegra_drm_sched_timedout_job] ERROR 3d channel: pipes 0x2 (process:Xorg pid:616) Jul 18 09:22:23 alarm kernel: tegra-gr3d 54180000.gr3d: [drm:tegra_drm_sched_timedout_job] resetting hardware Jul 18 09:22:24 alarm kernel: [drm] tegra_drm_sched_timedout_job: 2d channel: pipes 0x1 (process:Xorg pid:616) Jul 18 09:22:24 alarm kernel: [drm:tegra_drm_sched_timedout_job] ERROR 2d channel: pipes 0x1 (process:Xorg pid:616) Jul 18 09:22:24 alarm kernel: tegra-gr2d 54140000.gr2d: [drm:tegra_drm_sched_timedout_job] resetting hardware Jul 18 09:22:24 alarm kernel: [drm] tegra_drm_sched_timedout_job: 2d channel: pipes 0x1 (process:Xorg pid:616) Jul 18 09:22:24 alarm kernel: [drm:tegra_drm_sched_timedout_job] ERROR 2d channel: pipes 0x1 (process:Xorg pid:616) Jul 18 09:22:24 alarm kernel: tegra-gr2d 54140000.gr2d: [drm:tegra_drm_sched_timedout_job] resetting hardware Jul 18 09:22:24 alarm kernel: [drm] tegra_drm_sched_timedout_job: 2d channel: pipes 0x1 (process:Xorg pid:616)

In the dmesg in this state there is this message: tegra-mc 7000f000.memory-controller: host1xdmar: DMA blocked tegra-mc 7000f000.memory-controller: host1xdmar: read @0xb565fa30: EMEM address decode error (EMEM decode error)

After a warm reboot, Xorg starts fine. The issue only happens when starting from power off. I am wondering if it is related to how u-boot (2013-07) initialises hardware, and if it is related to the inability to power off completely with poweroff command in recent kernels.

On a separate question, I tried running the tests in grate and host1x. Most seem (gr3d) seem to fail with INFO: x11_overlay_create:39 overlay unsupported ERROR: grate_overlay_create: host1x_overlay_create() failed: -1

dmesg-cold.txt dmesg-warm.txt

Xorg.0.cold.log Xorg.0.warm.log

digetx commented 4 years ago

Thank you very much for the report!

Could you please try the recent grate-kernel update? I added this change https://github.com/grate-driver/linux/commit/15996b0c71b192e95961a251d987cef9d5867ff0. It should be a kernel driver problem which pops up only if hardware is in a certain state during boot. This problem was already reported for AC100 not so long time ago and I couldn't reproduce it.

Host1x driver doesn't support recovering from a blocked DMA. So hardware is getting a reset, but DMA stays blocked until the warm reboot, hence that's why it works after warm reboot.

The overlay unsupported message and further error are supposed to happen when running tests under Xorg, you may ignore it. Could you please clarify what do you mean by seem to fail, do you get any other errors?

emulti commented 4 years ago

I'll update the kernel on AC100 and test again tonight if possible. I should not have said 'seem', the following are results, running from the grate folder with tests/xxx/yyy so the files in asm are found: host1x: gr2d-blit: INFO: main:175: test passed gr2d-clear: -displays magenta window, and 'overlay support missing' gr2d-context: - see attachment gr3d-triangle - displays gamut triangle, then magenta window, and 'overlay support missing'

grate: clear: magenta window, ERROR: grate_overlay_create: host1x_overlay_create() failed: -1 cube, cube-textured: 'CgDrv_Create: BLOB compiler is unavailable' (x2) 'grate_program_new() failed' cube-textured2: displays jailbars scene keyed over floating cubes behind cube-textured3 fails after loading to '24%' with ERROR: host1x bo_create_helper:237: host1x_bo_create failed; ERROR:grate_create_texture: failed to allocate texture 2048x2048 bpp:8 pitch:4096; Segmentation fault (AC100 display too small?) interactive: floating cube responds correctly to keyboard commands, in 'face cull mode' the text on cube is mirrored except in 'none', can be flipped with '2' key, front face direction. '3' depth function blanks cube in some modes, maybe as intended. quad: 'CgDrv_Create: BLOB compiler is unavailable' (x2) 'grate_program_new() failed' stencil: "Stencil test works!" texture-filter: works, stepping pixelation on cube texture-wrap: runs, not sure what is expected behaviour! triangle: 'CgDrv_Create: BLOB compiler is unavailable' (x2) 'grate_program_new() failed' triangle-rotate: 'CgDrv_Create: BLOB compiler is unavailable' (x2) 'grate_program_new() failed' gr2-context.txt

digetx commented 4 years ago

All results look good! Please ignore the failed tests because they depend on extra bits, and thus, expected to fail in a default userspace/kernel configuration.

The cube-textured3 requires a lot of free contiguous memory, you may try to get it working by adding cma=128M (maybe even 256M) into the kernel's cmdline arguments.

emulti commented 4 years ago

Does allocating larger cma reduce the amount of memory for non-gpu-related applications? I was thinking of trying reducing it from the current 64M if it does, memory is short on ac100.

I built the linux-grate 5.8.0-rc5-g81e240239be6 (after yesterday's commits) Unfortunately on cold boot it hangs, before systemd journaling starts, and I don't have a serial console connection to capture the issue. But we probably know what it's from... dmesg-5.8.0-rc5.txt

On warm boot there is an issue with a duplicate regulator name regulator.5:cpu0 (from the device tree, maybe cpufreq/DVFS related?) as in the attached dmesg. Strange, Device tree binary is different size but there were no new dts commits I have found, maybe I am looking in the wrong place.

digetx commented 4 years ago

Larger CMA shouldn't reduce the amount of memory. It's a reusable (system) memory that could be swapped out during of contiguous allocation, it's not a carveout.

Hmm.. now I'm also seeing that there is some kernel problem using next-20200717:

 BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 12, name: kworker/0:1
 CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.8.0-rc5-next-20200717-00162-g4bcedc60754a #2833
 Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
 Workqueue: rcu_gp srcu_invoke_callbacks
 [<c010dad5>] (unwind_backtrace) from [<c0109481>] (show_stack+0x11/0x14)
 [<c0109481>] (show_stack) from [<c0469da9>] (dump_stack+0x8d/0x9c)
 [<c0469da9>] (dump_stack) from [<c013e67d>] (___might_sleep+0xed/0x11c)
 [<c013e67d>] (___might_sleep) from [<c09b8d6d>] (mutex_lock+0x1d/0x54)
 [<c09b8d6d>] (mutex_lock) from [<c055bd03>] (device_del+0x2b/0x2a4)
 [<c055bd03>] (device_del) from [<c055bfd9>] (__device_link_free_srcu+0x41/0x50)
 [<c055bfd9>] (__device_link_free_srcu) from [<c017037f>] (srcu_invoke_callbacks+0x9b/0x100)
 [<c017037f>] (srcu_invoke_callbacks) from [<c0133dbd>] (process_one_work+0x145/0x408)
 [<c0133dbd>] (process_one_work) from [<c0134179>] (worker_thread+0xf9/0x3c4)
 [<c0134179>] (worker_thread) from [<c0138ef3>] (kthread+0x10b/0x13c)
 [<c0138ef3>] (kthread) from [<c010015d>] (ret_from_fork+0x11/0x34)
 Exception stack(0xef159fb0 to 0xef159ff8)

It probably has the same root as yours regulator issue. I'll check if today's next still has that issue and will ping you once the problem will be resolved.

digetx commented 4 years ago

Please give a try to the recent grate-kernel update, I reverted the offending commits.

emulti commented 4 years ago

Yes, the reverted patches (5.8.0-rc6...) put things back as they were (host1xdmar: DMA blocked on first cold boot). In testing with the previous kernel 5.8.0-rc5... I found that the results are not consistent from cold boot. It depends whether the system was powered off by software (holding down power button after 'poweroff', because the system is in unresponsive state after with power LED still on), or by removing the battery. So the condition of the EC varies. This incomplete powerdown issue has been reported on kernels after at least 5.4, but is not present on 5.1.1 which I have used previously and tested again today. Perhaps this is linked to the incorrect initialisation of host1x. I will try and find out what changed in the nvec code between 5.1 and 5.4, but if doesn't look like much. The issue 'sysfs: cannot create duplicate filename '/devices/virtual/devlink/regulator.5:cpu0' is not present any more in rc6. I changed the nvec driver to built-in rather than an module, it is loaded much earlier. uvcvideo driver is also now compiled-in.

digetx commented 4 years ago

Thank you for the report! I'll add some debug messages to the host1x driver that may help to figure out what's wrong, will ping you once debugging will be ready for the testing.

digetx commented 4 years ago

@emulti Could you please fetch a recent grate-kernel update and post kernel boot log? I added some debug messages which may shed some light on the host1x problem.

emulti commented 4 years ago

Updated- here are cold and warm boot dmesg

I am tracking down the issue with incorrect power-off using git bisect. It takes a while... Somewhere between 5.4.8 (good) and 5.4.10 (bad) on the linux-stable tree. But the indication is that these issues are not linked, after a 'good' shutdown, the 'Host1x DMA blocked' issue still occurs with linux-grate after a cold boot.

dmesg-grate-warm.txt dmesg-grate-cold.txt

digetx commented 4 years ago

Thank you! I pushed another update to the grate-kernel, now host1x driver resets the memory client state and there are couple more messages. Please give it a try and post the cold-boot log.

emulti commented 4 years ago

Here are cold and warm boot dmesg of master branch as of 4 Aug. One patch has been applied to fix the 'incomplete power off" issue with AC100. This is a revert of offending commit: 43cf75d96409a20ef06b756877a2e72b10a026fc upstream. exit: panic before exit_mm() on global init exit (21 Dec 2019)

Cold boot log was taken after sudo poweroff following warm boot, the errors tegra-mc 7000f000.memory-controller: host1xdmar: DMA blocked tegra-mc 7000f000.memory-controller: host1xdmar: read @0xb560fa30: EMEM address decode error (EMEM decode error)

There is also an error tegra2-devfreq: memory controller has no timings which I think is because this unit has the type of DRAM for which no timing info is available for the device tree.

dmesg-grate-cold-0408.txt dmesg-grate-warm-0408.txt poweroff first bad commit.txt

digetx commented 4 years ago

Thanks for the testing!

I don't have any good comments regarding the power-off issue, could be that the offending change unmasks some other problem.

The host1x trouble remains mysterious for now.

@thierryreding @cyndis do you have any idea why host1x isn't idling on AC100 on a cold boot? That's likely to be a bug in the grate-kernel host1x driver, but it's not apparent to me what's wrong.. although the problem is known to exist only on AC100. Does host1x have any register-writes buffering (not mentioned in TRM) that needs to be flushed? The upstream host1x driver also uses a different DMA usage scheme and it could be that the problem isn't visible in upstream because the CDMA limits are set to the push buffer's start/end, while in grate-kernel the addressing is unlimited, hence host1x should silently stop on fetching from a wrong memory address in upstream.

emulti commented 4 years ago

The bad commit causing the power-off issue was identified with a git bisect on the linux-stable tree, which I assume means it is definitely the culprit.

I also built 5.7.9 from the stable tree. Reverting that commit also restores power-off behaviour. When the nvec keyboard is probed an oops in the clk driver is reliably caused, as in attached dmesg. Sorry, I don't know how to analyze the stack traces.

dmesg-5.7.9.txt

The logging is different and includes: [ 0.085113] tegra20-emc 7000f400.memory-controller: no memory timings for RAM code 1 found in device tree [ 0.085151] tegra20-emc: probe of 7000f400.memory-controller failed with error -22

linux-grate has: [ 1.866723] tegra20-emc 7000f400.memory-controller: no memory timings for RAM code 1 found in device tree [ 1.877055] tegra20-devfreq tegra20-devfreq: memory controller has no timings

Should I build again with DVFS disabled?

digetx commented 4 years ago

Looking at the NVEC driver, I think it's the source of the problem because interrupts might be disabled at the kernel's power-off stage.

https://elixir.bootlin.com/linux/v5.8/source/drivers/staging/nvec/nvec.c#L760 https://elixir.bootlin.com/linux/v5.8/source/drivers/staging/nvec/nvec.c#L273

The Tegra I2C suffered from a similar problem until it got support for atomic transfers, which allow I2C transfers to be made with disabled interrupts.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ede2299f7101a79fe8610ca0000734c9887ad4b2

Somebody needs to implement the polling transfer mode for NVEC driver and use it for nvec_power_off() in order to resolve the problem.

The backtrace in the 5.7 log is harmless, it's a known problem that eventually should be fixed by backporting fix from 5.8.

The EMC / devfreq messages are also okay, the EMC driver improvement which will silence the error is pending to be upstreamed.

emulti commented 4 years ago

Thanks for your advice! The EC does shut down reliably on power-off (battery doesn't drain, or very slowly) once the commit mentioned is reverted. Because, I assume, the shutdown message gets through (by luck) to the EC before interrupts are disabled. The commit maybe makes that happen earlier, so the EC remains on and isn't listening for a power-button-press to power up again.

I read up on atomic transfers enough to understand that implementing them in the nvec driver is way beyond my capabilities... Would it be enough to just implement just for nvec_power_off and presumably nvec_suspend, and leave the other transfers as interrupt driven?

I do have a trivial but useful patch to nvec.c that unmutes the AC100 internal speakers on Resume. I have no idea how to get that into the kernel though.

I have surprised myself with how much functionality and simultaneous tasks can be done on an AC100 with a dual-core 1Ghz CPU and 512MB of RAM, once bloat is carefully removed.

digetx commented 4 years ago

Yeah, it probably happened to work by luck before. If NVEC driver could re-enable interrupts on power-off, then it could become a one-line fix. Somebody should check the kernel's shutdown code path in order to see if it's a safe thing to do.

Adding a polling alternative to the interrupt-driven code shouldn't be much work to do, I may try to type a draft patch sometime later on. I don't know whether it's possible to change only the nvec_power_off() because not very familiar with the NVEC. Maybe @paulfertser could help?

Regrading submitting patches to upstream, please see https://www.kernel.org/doc/html/latest/process/submitting-patches.html, you may also find video tutorials on YouTube. And of course please feel free to ask any questions on IRC, I'll be glad to help.

Short example of submitting a kernel patch:

# git format-patch -v1 -1 df8476db20b7

# ./scripts/checkpatch.pl --strict v1*

# ./scripts/get_maintainer.pl v1*

# git send-email --smtp-server=smtp.gmail.com --smtp-user=digetx@gmail.com --smtp-encryption=tls --smtp-server-port=587 --suppress-cc=all --to 'Thierry Reding <thierry.reding@gmail.com>' --to 'Jonathan Hunter <jonathanh@nvidia.com>'  --cc 'linux-tegra@vger.kernel.org' --confirm=always v1*

Could you please give a try to the recent grate-kernel update? I added DMA addressing limitation for Host1x https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/channel_hw.c#L288, maybe it will help. Although, in best case it will put grate-kernel driver on par with the upstream driver, the real origin of the problem will remain unknown.

emulti commented 4 years ago

Thanks for info on submitting patches etc. After building grate-kernel update, the "tegra-mc 7000f000.memory-controller: host1xdmar: DMA blocked" is still present after cold boot. I also noticed the Xorg X-video extension is most often not initialised correctly, ("opentegra(0) xv.c: Xorg.0.log_bad_0807.txt

This after a warm reboot. Maybe one time in ten it will initialise correctly: Xorg.0.log_good_0807.txt

On one occasion Xorg failed to start (VGA arbiter: cannot open kernel arbiter, no multi-card support") but I think this could be a systemd or configuration issue ((EE) systemd-logind: failed to get session: PID 475 does not belong to any known session) Xorg.0.log_fail.txt

I noticed a new Staging driver is in WIP for an Acer Iconia A500 EC, an ENE KB930 with custom firmware according to the file. This is actually broadly similar to the AC100 EC, which is an ENE KB926 with Toshiba (Compal?) firmware. In the Toshiba case the KB926 handles the keyboard and touchpad rather than the T250. But maybe some parts of the code can point to how to improve the Nvec driver?

digetx commented 4 years ago

Thanks, it's a good sign that the DMA isn't fixed, meaning that it should be a local driver issue. I pushed update to the grate-kernel which dumps channels hardware state on boot, could you please show the cold boot log?

The Xv problem is odd, I haven't ever seen it and not sure how it could happen. Could you please give a detailed steps of how to reproduce it?

The Acer EC uses Tegra I2C driver for the I2C transfers, hence it uses atomic transfer for sending the power-off command. NVEC driver should do the same thing as the Terga I2C, i.e. to poll interrupt status instead of waiting for interrupt event on shutdown.

emulti commented 4 years ago

I'll rebuild tomorrow and send the new log. By local driver issue, do you mean a config mistake in the kernel? The config I am using is based on tegra_defconfig as attached.

The XV initialization issue occurs just by starting Xorg using xinit/.xinitrc/startx. It is the same whether Xorg is running as root or as the unprivileged user. Just sometimes, it inits correctly, with no other changes in configuration, maybe a warm reboot. 'xvinfo' shows 'no adaptors present' for screen 0.

I fixed the session issue with systemd-logind by adding '-keeptty' to the Xserver config file. It was linked to the communication between systemd-logind and Xorg-server introduced in 1.16 to allow Xorg to run as an unprivileged user. The user is in the 'video' group, and permissions on /dev/dri/card0 are 660, owner root, group video.

config.txt

I'll see if I can figure out how to adapt the polling method used in the Acer EC driver to work in NVEC.

digetx commented 4 years ago

I meant it should be a grate-kernel driver bug.

Could you reproduce the Xv problem by starting Xorg without using startx? Just by running Xorg from root.

Unprivileged Xorg doesn't work well on many Linux distros, I only managed to load unprivileged Xorg on Debian and even then it doesn't work from ssh session.

emulti commented 4 years ago

Running Xorg from root with sudo Xorg: XVideo adaptor is initialised correctly every time. But if -keeptty is appended (to allow systemd-logind to control the session) it fails. Same with a DM like lxdm: Xvideo is OK every time But from startx or xinit it fails (almost) every time, both with Xorg running as root or as a user. I haven't been able to find out the reason. The environment seems to be the same, and the same files .xinitrc .xserverrc and .xprofile are sourced with a DM and with startx/xinit. I am guessing it is to do with systemd-logind and permissions. Under the DM systemd-logind complains it can't keep track of the session: "PID xxx does not belong to any known session"

emulti commented 4 years ago

Taking the various init, profile files etc. out of the equation:

sudo Xorg
...
(II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration
...

(II) opentegra(0): XV adaptor initialized

But

sudo Xorg -keeptty (or sudo Xorg vt$XDG_VTNR) : 
...
systemd-logind: took control of session /org/freedesktop/login1/session/c1
...

(EE) opentegra(0): xv.c:1749/TegraXvGetDrmPlaneProperty(): Failed to get "CRTC_ID" property
(EE) opentegra(0): xv.c:1750/TegraXvGetDrmPlaneProperty(): Available properties:
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "type"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "IN_FORMATS"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "zpos"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "YUV to RGB CSC"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "rotation"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "colorkey.plane_mask"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "colorkey.mode"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "colorkey.mask"
(EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "colorkey.min"
EE) opentegra(0): xv.c:1757/TegraXvGetDrmPlaneProperty():         "colorkey.max"
(EE) opentegra(0): xv.c:2011/TegraXvScreenInit(): XV initialization failed
(II) opentegra(0): VBLANK initialized
(II) opentegra(0): [DRI2] Setup complete
(II) opentegra(0): [DRI2]   DRI driver: tegra
(II) opentegra(0): [DRI2]   VDPAU driver: tegra
(II) opentegra(0): DRI2 initialized
(EE) opentegra(0): failed to set mode: Permission denied(II) Initializing extension Generic Event Extension

It appears that systemd-logind and its partner systemd-pam are somehow restricting access to the hardware if they grab control of the session.

emulti commented 4 years ago

Cold and warm boot logs with linux-grate kernel from 20200810 are attached. dmesg_cold_5.8.0-20200810.txt dmesg_warm_5.8.0-20200810.txt

edit, adding a third log of warm boot after cold boot+reboot, the 'random' values for 'dmaget' become cleaned up. dmesg_warm2_5.8.0-20200810.txt

digetx commented 4 years ago

Thank you for the testing! I still don't know why Host1x is misbehaving on cold boot on AC100, but I pushed another change to the grate-kernel that makes CDMA to be stopped after initialization, could you please give it a try?

I couldn't reproduce the Xv problem on Ubuntu 20.04. Could you please tell what distro you're using?

(II) systemd-logind: took control of session /org/freedesktop/login1/session/c1
...
(0): XV adaptor initialized
emulti commented 4 years ago

I have Arch Linux Arm installed to emmc. By default, Xorg is run rootless when starting from xinit or startx. But I have tried overriding this (/etc/X11/Xwrapper.config) and it makes no difference whether root or rootless. It seems when systemd takes over the session then there is an issue with opentegra reading card properties.

I corrected an error in directory permissions on /usr/share/polkit-1/rules.d that was preventing polkit loading default rules, (logged as a bug on Arch site) but I don't think it's linked with the XV issue. I'll keep investigating. DRI devices are tagged as 'uaccess' in udev so should be fully accessible, but there is some method that systemd-logind uses to revoke that access when a user session is not 'Active'. Does the opentegra driver run as group 'video'? This is the contents of /dev/dri:

drwxr-xr-x  2 root root         80 Aug 11 19:56 by-path
crw-rw----+ 1 root video  226,   0 Aug 11 19:56 card0
crw-rw-rw-  1 root render 226, 128 Aug 11 19:56 renderD128

It's interesting that /dev/dri/card0 has no user-level access.

Here are the boot logs with latest update. On cold boot the tegra-mc: DMA blocked message is gone and Xorg can be started. Warm boot is also OK. The debug info from the last update isn't shown now. dmesg_cold_5.8.0-20200811.txt dmesg_warm_5.8.0-20200811.txt

digetx commented 4 years ago

Xv needs access to the DRM atomic UAPI, I guess the permission is getting dropped somehow. It could be systemd or Xorg issue, or some configuration problem. For the starter I need to reproduce the problem and will try Arch, meanwhile you may check if problem exists on Ubuntu or Debian.

I dropped the boot logs because enough logs has been collected, thank you very much! It's great that the remedy has been found!

digetx commented 4 years ago

@emulti Hello! I got around to trying Arch and managed to reproduce and fix the Xv problem! It's fixed now in the Opentegra driver https://github.com/grate-driver/xf86-video-opentegra/commit/3c22a038c175e0e117371379c2d40041f931cf6f

I also fixed the real root of the "DMA blocked" bug after finding that the AC100 fix broke Nexus 7 in a similar way. If you'll try a recent grate-kernel and AC100 doesn't work again on a cold boot, then please let me know!

emulti commented 4 years ago

Thanks digetx, I will test this out in the next few days. Looks promising from the commit. As long as a cold reboot is not done, my AC100 has been very stable for the last month, with patch to unmute speakers after resume.