vpelletier / python-functionfs

Pythonic API for linux's functionfs
GNU General Public License v3.0
40 stars 13 forks source link

Raspberry Pi 4 freeze after running device.py #16

Closed forderud closed 2 years ago

forderud commented 3 years ago

I'm experiencing problems with my Raspberry Pi 4 freezing after running the device.py sample. The script itself runs fine and completes with a Gadget exiting. message to the console. However, the OS usually freezes a few seconds later, with the following symptoms:

My setup: Latest Raspberry Pi OS (32bit) that is based on Linux 5.4.83-v7l+. The script is interpreted by the in-built python3 with functionfs 0.5.0 installed using pip3.

I'm not observing any problems when playing with the https://randomnerdtutorials.com/raspberry-pi-zero-usb-keyboard-hid sample, so I have a feeling that the problem is somehow tied to functionfs.

vpelletier commented 3 years ago

What would help most would be to get a kernel panic (which is likely what's happening here) message.

Getting them from a graphical desktop may not be easy, as the panic may halt everything before the UI had time to output. So a first step would be to switch to a text terminal, ideally configured with a large enough resolution so the panic message beginning does not scroll out of view. Then, you can trigger the bug and see what the kernel will be complaining about.

The next possibility (which would not require switching to a text terminal) if you have another PI (or other serial TTL capable computer, like a FTDI USB-to-serial cable) is to connect to the TTL serial port which should be on the GPIO header, to tell the kernel of the PI4 to output its kernel messages there if it is not already the case, and to have the connected machine listen to it (I am partial to screen /dev/ttyUSB0 115200, but there are plenty of others).

This should be enough to see what is going wrong. Which will certainly be a kernel bug (on the principle that no userland program, however buggy, should be able to crash a machine...), but it may also tell me if python-functionfs could do things differently to avoid upsetting the kernel (it has been my experience that UDCs have quite a temperament... I blame this on not enough people experimenting with the wonders of USB gadgetry).

I would also recommend trying to then reproduce with a more recent kernel, which will likely mean either building your own (which I remember being a bit painful with raspbian, but this was several years ago - I remember not being able to get a working .config) or switching to another distribution. I can recommend raspi Debian sid, as this is what I've been using to work on my smartcard implementation, which has involved a lot of run/exit cycles without having to reboot. But it will certainly lack all the turn-key goodness of Raspbian.

If you want to go one level deeper in the analysis over TTL serial, you can enable kgdb on the PI and live-debug the kernel with gdb on the connected machine. Here the challenge (as I remember it) is initially to get a linux image that gdb is happy with (so a non-gzipped image).

DBlauhut commented 3 years ago

Hello, i'm working with an orange pi 4 with a yocto based linux 4.19 system. i have a similar issue. my c application works fine with functionfs, but the python3-functionfs application triggers a kernel panic after the termination of the application via the signal SIGINT. i have started an investigation and found a difference in the kernel logs. The c application generated the following messages in the log:

[ 531.443132] unloading [ 531.502274] ffs_data_put(): freeing

those entries are swapped in the log of the python3-functionfs application which triggers the panic:

[ 722.635668] ================================================================== [ 722.635733] BUG: KASAN: use-after-free in refcount_sub_and_test_checked+0x1c/0xa8 [ 722.635751] Read of size 4 at addr ffffffc0be374680 by task python3/1374 [ 722.635763] [ 722.635785] CPU: 5 PID: 1374 Comm: python3 Tainted: G O 4.19.193-rockchip-standard #1 [ 722.635800] Hardware name: OrangePi 4 (DT) [ 722.635813] Call trace: [ 722.635834] dump_backtrace+0x0/0x244 [ 722.635852] show_stack+0x24/0x30 [ 722.635873] dump_stack+0x9c/0xc8 [ 722.635893] print_address_description+0x60/0x248 [ 722.635913] kasan_report+0x140/0x190 [ 722.635931] kasan_report+0xc/0x14 [ 722.635948] asan_load4+0x24/0x9c [ 722.635970] refcount_sub_and_test_checked+0x1c/0xa8 [ 722.636016] refcount_dec_and_test_checked+0x14/0x1c [ 722.636035] eventfd_ctx_put+0x20/0x3c [ 722.636070] ffs_data_clear+0x164/0x1bc [usb_f_fs] [ 722.636104] ffs_data_reset+0x20/0x110 [usb_f_fs] [ 722.636130] ffs_data_closed+0xd0/0x100 [usb_f_fs] [ 722.636163] ffs_fs_kill_sb+0x34/0x40 [usb_f_fs] [ 722.636181] deactivate_locked_super+0x58/0x90 [ 722.636197] deactivate_super+0x6c/0x78 [ 722.636216] cleanup_mnt+0x78/0xb0 [ 722.636246] cleanup_mnt+0x20/0x2c [ 722.636263] task_work_run+0xa4/0xe0 [ 722.636294] do_notify_resume+0x110/0x13c [ 722.636310] work_pending+0x8/0x10 [ 722.636332] [ 722.636345] Allocated by task 1379: [ 722.636366] kasan_kmalloc.isra.0.part.0+0x4c/0xfc [ 722.636393] kasan_kmalloc.isra.0+0x80/0x9c [ 722.636410] kasan_kmalloc+0xc/0x14 [ 722.636436] kmem_cache_alloc_trace+0x1c4/0x224 [ 722.636453] do_eventfd+0x54/0xec [ 722.636471] arm64_sys_eventfd2+0x38/0x48 [ 722.636498] el0_svc_common.constprop.0+0x138/0x1e4 [ 722.636515] el0_svc_handler+0x8c/0xb0 [ 722.636539] el0_svc+0x8/0xc [ 722.636549] [ 722.636563] Freed by task 1389: [ 722.636598] __kasan_slab_free+0x13c/0x1e8 [ 722.636617] kasan_slab_free+0x10/0x18 [ 722.636641] kfree+0x19c/0x23c [ 722.636658] eventfd_ctx_put+0x30/0x3c [ 722.636686] eventfd_release+0x40/0x50 [ 722.636701] fput+0x17c/0x280 [ 722.636715] __fput+0x20/0x2c [ 722.636730] task_work_run+0xa4/0xe0 [ 722.636757] do_exit+0x5f0/0xd6c [ 722.636775] do_group_exit+0x84/0x108 [ 722.636802] get_signal+0x324/0xa10 [ 722.636820] do_signal+0x120/0x338 [ 722.636846] do_notify_resume+0xe4/0x13c [ 722.636861] work_pending+0x8/0x10 [ 722.636880] [ 722.636895] The buggy address belongs to the object at ffffffc0be374680 [ 722.636895] which belongs to the cache kmalloc-128 of size 128 [ 722.636932] The buggy address is located 0 bytes inside of [ 722.636932] 128-byte region [ffffffc0be374680, ffffffc0be374700) [ 722.636949] The buggy address belongs to the page: [ 722.636967] page:ffffffbf02f8dd00 count:1 mapcount:0 mapping:ffffffc000202280 index:0x0 compound_mapcount: 0 [ 722.636999] flags: 0x10200(slab|head) [ 722.637022] raw: 0000000000010200 dead000000000100 dead000000000200 ffffffc000202280 [ 722.637050] raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000 [ 722.637064] page dumped because: kasan: bad access detected [ 722.637084] [ 722.637095] Memory state around the buggy address: [ 722.637112] ffffffc0be374580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 722.637137] ffffffc0be374600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 722.637154] >ffffffc0be374680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 722.637175] ^ [ 722.637190] ffffffc0be374700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 722.637207] ffffffc0be374780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 722.637219] ================================================================== [ 722.637245] Disabling lock debugging due to kernel taint [ 722.637259] ffs_data_put(): freeing [ 722.637932] unloading

i will continue my investigation.

mcuee commented 3 years ago

I can reproduce the issue with my OrangePi Zero but it is with an old kernel. I will try with a new kernel to see how it goes.

mcuee@orangepizero:~$ uname -a
Linux orangepizero 4.14.70-sunxi #265 SMP Wed Sep 19 10:01:19 CEST 2018 armv7l GNU/Linux
mcuee commented 3 years ago

My temporary workaround is to have another terminal and kill the processes using kill.

mcuee commented 3 years ago

I believe it is a kernel version issue. I just set up a newer version of Armbian Linux on another board -- OrangePi Zero Plus and it works fine.


mcuee@orangepizeroplus:~/python-functionfs$ uname -a
Linux orangepizeroplus 5.10.60-sunxi64 #21.08.1 SMP Wed Aug 25 18:29:57 UTC 2021 aarch64 GNU/Linux

mcuee@orangepizeroplus:~/python-functionfs$ sudo python3 ./examples/usbcat/device.py
epoll: fd 3 got event 1
onBind
epoll: fd 3 got event 1
onEnable
epoll: fd 3 got event 1
aio read completion received 100 bytes
^Cmcuee@orangepizeroplus:~/python-functionfs$
mcuee@orangepizeroplus:~/python-functionfs$ python3 --version
Python 3.9.2
mcuee@orangepizeroplus:~/python-functionfs$ sudo python3 ./examples/hid/device.py
Gadget ready, waiting for function to exit.
onEnable called

^CGadget exiting.
mcuee@orangepizeroplus:~/python-functionfs$ uname -a
Linux orangepizeroplus 5.10.60-sunxi64 #21.08.1 SMP Wed Aug 25 18:29:57 UTC 2021 aarch64 GNU/Linux
mcuee commented 3 years ago

I can reproduce the issue with my OrangePi Zero but it is with an old kernel. I will try with a new kernel to see how it goes.

mcuee@orangepizero:~$ uname -a
Linux orangepizero 4.14.70-sunxi #265 SMP Wed Sep 19 10:01:19 CEST 2018 armv7l GNU/Linux

Once I upgrade to the latest Armbian build for my OragePi Zero, the issue goes away.

mcuee@orangepizero:~/gadget/python-functionfs$ sudo python3 ./examples/usbcat/device.py 
epoll: fd 3 got event 1
onBind
epoll: fd 3 got event 1
onEnable
epoll: fd 3 got event 1
aio read completion received 100 bytes
^C
mcuee@orangepizero:~/gadget/python-functionfs$ sudo python3 ./examples/hid/device.py 
Gadget ready, waiting for function to exit.
onEnable called

^CGadget exiting.
mcuee@orangepizero:~/gadget/python-functionfs$ uname -a
Linux orangepizero 5.10.60-sunxi #21.08.1 SMP Wed Aug 25 18:19:32 UTC 2021 armv7l GNU/Linux
vpelletier commented 3 years ago

In my experience, it is indeed best to be on as recent a kernel as possible. Just by writing the code in this project I have triggered a good share of kernel bugs on UDCs which are (I believe) rather common (DWC2 and DWC3), and a few more when I used AIO instead of threads & blocking IO.

Which is not to say that my code is perfect, of course (my current work on python-libusb1 and the symptom of an unexpected log line order is making my "python garbage-collector" alarm blink), just that it does not surprise me that the experience differs a lot depending on the exact kernel version.

mcuee commented 3 years ago

I will try out on my Raspberry Pi 400 as well to see if I can reproduce the issue on my Raspberry Pi 400 with latest release of Raspbian 32bit, or other 64bit Linux.

mcuee commented 3 years ago

Very strange that it does not work under Ubuntu 64bit for Raspberry Pi 400.


mcuee in rpi400Ubuntu in python-functionfs on  master via 🐍 v3.9.5 ❯ sudo python3 ./examples/usbcat/device.py
Traceback (most recent call last):
  File "/home/mcuee/build/gadget/python-functionfs/./examples/usbcat/device.py", line 284, in <module>
    main()
  File "/home/mcuee/build/gadget/python-functionfs/./examples/usbcat/device.py", line 254, in main
    with GadgetSubprocessManager(
  File "/usr/local/lib/python3.9/dist-packages/functionfs-0.5.0+1.g848e79b-py3.9.egg/functionfs/gadget.py", line 481, in __enter__
  File "/usr/local/lib/python3.9/dist-packages/functionfs-0.5.0+1.g848e79b-py3.9.egg/functionfs/gadget.py", line 254, in __enter__
  File "/usr/local/lib/python3.9/dist-packages/functionfs-0.5.0+1.g848e79b-py3.9.egg/functionfs/gadget.py", line 316, in __enter
  File "/usr/local/lib/python3.9/dist-packages/functionfs-0.5.0+1.g848e79b-py3.9.egg/functionfs/gadget.py", line 267, in mkdir
FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/config/usb_gadget/g_bh_52il2/functions/ffs.usb0'
vpelletier commented 3 years ago
  File "/usr/local/lib/python3.9/dist-packages/functionfs-0.5.0+1.g848e79b-py3.9.egg/functionfs/gadget.py", line 267, in mkdir
FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/config/usb_gadget/g_bh_52il2/functions/ffs.usb0'

This error during mkdir makes me think that it is actually the automated loading of g_ffs.ko which would be failing.

mcuee commented 3 years ago

This error during mkdir makes me think that it is actually the automated loading of g_ffs.ko which would be failing.

Thanks. For this paticular kernel, there is no g_ffs.ko. So I guess python-functionfs will not work.

mcuee in rpi400Ubuntu in drivers/usb/gadget🔒 ❯ ls function 
u_audio.ko    usb_f_ecm.ko         usb_f_mass_storage.ko  usb_f_obex.ko     usb_f_rndis.ko   usb_f_tcm.ko   u_serial.ko
u_ether.ko    usb_f_ecm_subset.ko  usb_f_midi.ko          usb_f_phonet.ko   usb_f_serial.ko  usb_f_uac2.ko
usb_f_acm.ko  usb_f_hid.ko         usb_f_ncm.ko           usb_f_printer.ko  usb_f_ss_lb.ko   usb_f_uvc.ko
mcuee in rpi400Ubuntu in drivers/usb/gadget🔒 ❯ ls legacy 
g_acm_ms.ko  g_audio.ko  g_dbgp.ko   g_hid.ko           g_midi.ko   g_ncm.ko    g_printer.ko  g_webcam.ko  raw_gadget.ko
gadgetfs.ko  g_cdc.ko    g_ether.ko  g_mass_storage.ko  g_multi.ko  g_nokia.ko  g_serial.ko   g_zero.ko    tcm_usb_gadget.ko
mcuee in rpi400Ubuntu in drivers/usb/gadget🔒 ❯ ls
function  legacy  libcomposite.ko  udc
mcuee in rpi400Ubuntu in drivers/usb/gadget🔒 ❯ pwd
/lib/modules/5.11.0-1016-raspi/kernel/drivers/usb/gadget
mcuee in rpi400Ubuntu in drivers/usb/gadget🔒 ❯ uname -a
Linux rpi400Ubuntu 5.11.0-1016-raspi #17-Ubuntu SMP PREEMPT Thu Jul 29 15:33:06 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
mcuee commented 3 years ago

Going back to Raspbian, it does not seem to work either. I will need to try again. Looks like there is an issue in this system with regard to udc (no udc available). I will upgrade and try again.

vpelletier commented 3 years ago

Ah, sorry, I mistyped: the functionfs module is drivers/usb/gadget/function/usb_f_fs.ko, not drivers/usb/gadget/legacy/g_ffs.ko. Both exist in the kernel source, but neither seem present in your rpi400 listing.

vpelletier commented 3 years ago

I improved error reporting in b3b1dcb4f423f9614545170dd81b5c8868ba7b44 .

mcuee commented 3 years ago

Going back to Raspbian, it does not seem to work either. I will need to try again. Looks like there is an issue in this system with regard to udc (no udc available). I will upgrade and try again.

It works once I managed to enable udc.

  1. Add dtoverlay=dwc2 to the /boot/config.txt
  2. Add modules-load=dwc2 to the end of /boot/cmdline.txt
  3. Add libcomposite to /etc/modules
pi in raspberrypi400 in python-functionfs on  master via 🐍 v2.7.16 ❯ sudo python3 ./examples/usbcat/device.py 
epoll: fd 3 got event 1
onBind
epoll: fd 3 got event 1
onEnable
epoll: fd 3 got event 1
aio read completion received 100 bytes
^C%                                                                                                                    
pi in raspberrypi400 in python-functionfs on  master via 🐍 v2.7.16 took 14s ❯ uname -a
Linux raspberrypi400 5.10.60-v7l+ #1449 SMP Wed Aug 25 15:00:44 BST 2021 armv7l GNU/Linux

I tesed on the host side (my Mac Mini M1) using the python-libusb codes mentioned here. https://github.com/vpelletier/python-libusb1/issues/72#issuecomment-913252143

python_usb on  master [?] via 🐍 v2.7.16 ❯ python3 test_usbcat.py
Number of bytes written:  100
write: True
read: bytearray(b'')
Closing USB
Close handle successfully
Calling closing method to delete self
Closing USB
Close handle successfully
mcuee commented 3 years ago

@forderud Please try again with a more updated Raspberry Pi OS. I believe the issue is no longer there based on my testing.

mcuee commented 3 years ago

I improved error reporting in b3b1dcb .

Nice. This is from another OS for my Raspberry Pi 400 (Manjaro ARM 64bit) after I enabled the udc. The error message is useful to fix the issue.

mcuee in manjaropi400 in python-functionfs on  master via 🐍 v3.9.4 ❯ sudo python3 ./examples/usbcat/device.py 
Traceback (most recent call last):
  File "/home/mcuee/build/libusb/python-functionfs/./examples/usbcat/device.py", line 284, in <module>
    main()
  File "/home/mcuee/build/libusb/python-functionfs/./examples/usbcat/device.py", line 254, in main
    with GadgetSubprocessManager(
  File "/usr/lib/python3.9/site-packages/functionfs/gadget.py", line 505, in __enter__
    super(GadgetSubprocessManager, self).__enter__()
  File "/usr/lib/python3.9/site-packages/functionfs/gadget.py", line 262, in __enter__
    self.__enter()
  File "/usr/lib/python3.9/site-packages/functionfs/gadget.py", line 280, in __enter
    name = tempfile.mkdtemp(
  File "/usr/lib/python3.9/tempfile.py", line 359, in mkdtemp
    _os.mkdir(file, 0o700)
FileNotFoundError: [Errno 2] /sys/kernel/config/usb_gadget/ does not exist, is libcomposite module loaded ?: '/sys/kernel/config/usb_gadget/g_bcj4kams'

mcuee in manjaropi400 in python-functionfs on  master via 🐍 v3.9.4 ❯ sudo modprobe libcomposite
mcuee in manjaropi400 in python-functionfs on  master via 🐍 v3.9.4 ❯ sudo python3 ./examples/usbcat/device.py
epoll: fd 3 got event 1
onBind
epoll: fd 3 got event 1
onEnable
epoll: fd 3 got event 1
aio read completion received 100 bytes
^C%             
mcuee in manjaropi400 in python-functionfs on  master via 🐍 v3.9.4 took 10s ❯ uname -a
Linux manjaropi400 5.10.35-1-MANJARO-ARM #1 SMP PREEMPT Fri May 7 13:57:33 UTC 2021 aarch64 GNU/Linux                                          
forderud commented 3 years ago

Thanks for all the helpful feedback. Please feel free to close the issue as assumed fixed in more recent Linux versions.

DBlauhut commented 2 years ago

hello! i'm still investigating this issue. i tried multiple systems.

the older kernel 4.19 and 5.4 got an ooops after using python3-functionfs. thanks to the enabled kernel feature kasan on arm64 architecture i was able to log the error.

The same issue exists on both 5.10 kernel, but the kernel are able to handle this issue and print a single kernel warning and continue to work. i will attach those warnings and i will continue my investigation

kernel_warning_raspbian_5_10.log kernel_warning_yocto_imx7_5_10.log

vpelletier commented 2 years ago

On a pi zero w with kernel 5.14 I am getting a similar-looking warning when the process exits (visibly when unmounting ffs). Both traces mention eventfd_ctx_put in the late stages of the detected error. The only eventfd I knowingly use is for AIO events. And checking the code I realise I am not closing it explicitly anywhere, which is not nice but at least should rule out a bug cause like the kernel writing (for whatever "write" may technically mean in the context of an eventfd... trigger a file event ?) to a closed eventfd.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 304 at lib/refcount.c:28 refcount_warn_saturate+0x110/0x15c
refcount_t: underflow; use-after-free.
Modules linked in: usb_f_ncm(E) u_ether(E) usb_f_fs(E) hci_uart(E) btqca(E) btrtl(E) btbcm(E) btintel(E)
bluetooth(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) bcm2835_v4l2(CE) bcm2835_mmal_vchiq(CE)
videobuf2_vmalloc(E) videobuf2_memops(E) videobuf2_v4l2(E) videobuf2_common(E) sha512_generic(E) videodev(E)
snd_bcm2835(CE) mc(E) bsoc_core(E) snd_pcm_dmaengine(E) snd_pcm(E) ansi_cprng(E) snd_timer(E) snd(E) soundcore(E)
raspberrypi_hwmon(E) cec(E) cfg80211(E) max17040_battery(OE) rc_core(E) regmap_i2c(E) ecdh_generic(E) rfkill(E)
ecc(E) bcm2835_rng(E) vchiq(CE) rng_core(E) leds_gpio(E) libcomposite(E) fuse(E) configfs(E) ip_tables(E)
x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) crc32c_generic(E) sdhci_iproc(E) sdhci_pltfm(E)
sdhci(E)
CPU: 0 PID: 304 Comm: smartcard-openp Tainted: G         C OE     5.14.0-3-rpi #1  Debian 5.14.12-1
Hardware name: BCM2835
Backtrace:
[<c08ed880>] (dump_backtrace) from [<c08edc28>] (show_stack+0x20/0x24)
 r7:00000009 r6:0000001c r5:c04a84f4 r4:c0a6600c
[<c08edc08>] (show_stack) from [<c08f1520>] (dump_stack+0x28/0x30)
[<c08f14f8>] (dump_stack) from [<c01231e8>] (__warn+0xe8/0x154)
 r5:c04a84f4 r4:c0a72e40
[<c0123100>] (__warn) from [<c08ee30c>] (warn_slowpath_fmt+0xa0/0xe4)
 r7:c04a84f4 r6:0000001c r5:c0a72e40 r4:c0a72e7c
[<c08ee270>] (warn_slowpath_fmt) from [<c04a84f4>] (refcount_warn_saturate+0x110/0x15c)
 r8:c0100244 r7:c0e00b84 r6:ffffffff r5:c33d3800 r4:c35faba0
[<c04a83e4>] (refcount_warn_saturate) from [<c0381518>] (eventfd_ctx_put+0x48/0x74)
[<c03814d0>] (eventfd_ctx_put) from [<bf3734e0>] (ffs_data_clear+0xd0/0x118 [usb_f_fs])
 r5:c33d3800 r4:db9e6100
[<bf373410>] (ffs_data_clear [usb_f_fs]) from [<bf375634>] (ffs_data_closed+0x9c/0x150 [usb_f_fs])
 r5:bf37a014 r4:db9e6100
[<bf375598>] (ffs_data_closed [usb_f_fs]) from [<bf375714>] (ffs_fs_kill_sb+0x2c/0x30 [usb_f_fs])
 r7:c0e00b84 r6:c3260dc0 r5:bf37a014 r4:c33cdc00
[<bf3756e8>] (ffs_fs_kill_sb [usb_f_fs]) from [<c0325548>] (deactivate_locked_super+0x54/0x9c)
 r5:bf37a014 r4:c33cdc00
[<c03254f4>] (deactivate_locked_super) from [<c03255f0>] (deactivate_super+0x60/0x64)
 r5:c30b9180 r4:c33cdc00
[<c0325590>] (deactivate_super) from [<c034861c>] (cleanup_mnt+0xe4/0x14c)
 r5:c30b9180 r4:00000000
[<c0348538>] (cleanup_mnt) from [<c03486fc>] (__cleanup_mnt+0x1c/0x20)
 r7:c0e00b84 r6:c3260dc0 r5:c3261254 r4:00000000
[<c03486e0>] (__cleanup_mnt) from [<c014390c>] (task_work_run+0x84/0xb8)
[<c0143888>] (task_work_run) from [<c010bfb0>] (do_work_pending+0x464/0x560)
 r7:5ac3c35a r6:c0d0424c r5:c32dbfb0 r4:c32da000
[<c010bb4c>] (do_work_pending) from [<c01000d0>] (slow_work_pending+0xc/0x20)
Exception stack(0xc32dbfb0 to 0xc32dbff8)
bfa0:                                     00000000 00000000 7fffffdb 000c99ac
bfc0: beac5e78 00000001 beac5f18 00000034 beac5f10 beac5e68 beac5e6c beac5e78
bfe0: 00000000 beac5e70 b67d0a1c b6d9c6a4 60000010 b5a3a418
 r10:00000000 r9:c32da000 r8:c0100244 r7:00000034 r6:beac5f18 r5:00000001
 r4:beac5e78
---[ end trace f66b55895bf9849a ]---
ffs_data_put(): freeing
unloading
vpelletier commented 2 years ago

(let's reopen this issue, just to make it visible that investigations are still ongoing)

vpelletier commented 2 years ago

Still happening on Debian (raspi port) 5.15.0-1 (this time from syslog to have the systemd lines related to service stopping):

systemd[1]: Stopping Behave like a CCID + smartcard combo USB device...
kernel: [  140.791681] ------------[ cut here ]------------
kernel: [  140.796674] WARNING: CPU: 0 PID: 308 at lib/refcount.c:28 refcount_warn_saturate+0x110/0x15c
kernel: [  140.805609] refcount_t: underflow; use-after-free.
kernel: [  140.810671] Modules linked in: usb_f_ncm(E) u_ether(E) usb_f_fs(E) hci_uart(E) btqca(E) btrtl(E)
btbcm(E) btintel(E) bluetooth(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) bcm2835_v4l2(CE) bcm2835_mmal_vchiq(CE)
videobuf2_vmalloc(E) videobuf2_memops(E) videobuf2_v4l2(E) sha512_generic(E) sha512_arm(E) videobuf2_common(E)
videodev(E) cpufreq_dt(E) snd_bcm2835(CE) brcmfmac(E) mc(E) ctr(E) brcmutil(E) vc4(E) snd_soc_core(E)
snd_pcm_dmaengine(E) drbg(E) snd_pcm(E) snd_timer(E) snd(E) soundcore(E) drm_kms_helper(E) ansi_cprng(E) cec(E)
rc_core(E) raspberrypi_cpufreq(E) raspberrypi_hwmon(E) syscopyarea(E) cfg80211(E) sysfillrect(E) sysimgblt(E)
fb_sys_fops(E) max17040_battery(OE) regmap_i2c(E) ecdh_generic(E) rfkill(E) ecc(E) vchiq(CE) bcm2835_rng(E)
rng_core(E) leds_gpio(E) libcomposite(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E)
mbcache(E) jbd2(E) crc32c_generic(E) sdhci_iproc(E) sdhci_pltfm(E) sdhci(E)
kernel: [  140.927046] CPU: 0 PID: 308 Comm: smartcard-openp Tainted: G         C OE     5.15.0-1-rpi #1  Debian 5.15.3-1
kernel: [  140.945271] Hardware name: BCM2835
kernel: [  140.952831] Backtrace:
kernel: [  140.959295] [<c08d60a0>] (dump_backtrace) from [<c08d62ec>] (show_stack+0x20/0x24)
kernel: [  140.975320]  r7:00000009 r6:0000001c r5:c04a948c r4:c0a64e2c
kernel: [  140.985567] [<c08d62cc>] (show_stack) from [<c08d9ae0>] (dump_stack+0x28/0x30)
kernel: [  140.997582] [<c08d9ab8>] (dump_stack) from [<c0123500>] (__warn+0xe8/0x154)
kernel: [  141.009240]  r5:c04a948c r4:c0a71dc8
kernel: [  141.017385] [<c0123418>] (__warn) from [<c08d6948>] (warn_slowpath_fmt+0xa0/0xe4)
kernel: [  141.033989]  r7:00000009 r6:0000001c r5:c0a71dc8 r4:c0a71e04
kernel: [  141.044294] [<c08d68ac>] (warn_slowpath_fmt) from [<c04a948c>] (refcount_warn_saturate+0x110/0x15c)
kernel: [  141.062494]  r8:c0100224 r7:c0dfcb84 r6:ffffffff r5:c29da000 r4:c33ee620
kernel: [  141.073861] [<c04a937c>] (refcount_warn_saturate) from [<c0380134>] (eventfd_ctx_put+0x48/0x74)
kernel: [  141.091651] [<c03800ec>] (eventfd_ctx_put) from [<bf54b4e8>] (ffs_data_clear+0xd0/0x118 [usb_f_fs])
kernel: [  141.109872]  r5:c29da000 r4:c308a300
kernel: [  141.117928] [<bf54b418>] (ffs_data_clear [usb_f_fs]) from [<bf54ccc0>] (ffs_data_closed+0x9c/0x150 [usb_f_fs])
kernel: [  141.136962]  r5:bf552014 r4:c308a300
kernel: [  141.144873] [<bf54cc24>] (ffs_data_closed [usb_f_fs]) from [<bf54cda0>] (ffs_fs_kill_sb+0x2c/0x30 [usb_f_fs])
kernel: [  141.163550]  r7:c0dfcb84 r6:c3226940 r5:bf552014 r4:c1f5e800
kernel: [  141.173596] [<bf54cd74>] (ffs_fs_kill_sb [usb_f_fs]) from [<c0326d50>] (deactivate_locked_super+0x54/0x9c)
kernel: [  141.192250]  r5:bf552014 r4:c1f5e800
kernel: [  141.200306] [<c0326cfc>] (deactivate_locked_super) from [<c0326df8>] (deactivate_super+0x60/0x64)
kernel: [  141.218145]  r5:c30ab000 r4:c1f5e800
kernel: [  141.226087] [<c0326d98>] (deactivate_super) from [<c0349a28>] (cleanup_mnt+0xe4/0x14c)
kernel: [  141.242943]  r5:c30ab000 r4:00000000
kernel: [  141.250965] [<c0349944>] (cleanup_mnt) from [<c0349b08>] (__cleanup_mnt+0x1c/0x20)
kernel: [  141.267377]  r7:c0dfcb84 r6:c3226940 r5:c3226ddc r4:00000000
kernel: [  141.277529] [<c0349aec>] (__cleanup_mnt) from [<c0143d10>] (task_work_run+0x84/0xb8)
kernel: [  141.294022] [<c0143c8c>] (task_work_run) from [<c010bdc8>] (do_work_pending+0x470/0x56c)
kernel: [  141.310943]  r7:5ac3c35a r6:c0d0424c r5:c29d5fb0 r4:c29d4000
kernel: [  141.321089] [<c010b958>] (do_work_pending) from [<c01000c0>] (slow_work_pending+0xc/0x20)
kernel: [  141.338116] Exception stack(0xc29d5fb0 to 0xc29d5ff8)
kernel: [  141.347689] 5fa0:                                     00000000 00000000 b5b7a5c0 000000e4
kernel: [  141.364654] 5fc0: befbfe40 00000001 befbfee0 00000034 befbfed8 befbfe30 befbfe34 befbfe40
kernel: [  141.381655] 5fe0: 00000000 befbfe38 b67b6a1c b6d816a4 60000010 b5a22468
kernel: [  141.392819]  r10:00000000 r9:c29d4000 r8:c0100224 r7:00000034 r6:befbfee0 r5:00000001
kernel: [  141.409416]  r4:befbfe40
kernel: [  141.416318] ---[ end trace 756cccb3dea8b1bd ]---
kernel: [  141.425258] ffs_data_put(): freeing
systemd[1]: tmp-ffs.usb0_e5zmgiph.mount: Deactivated successfully.
kernel: [  141.657398] unloading
systemd[1]: smartcard-gadget.service: Deactivated successfully.
systemd[1]: Stopped Behave like a CCID + smartcard combo USB device.
systemd[1]: smartcard-gadget.service: Consumed 25.199s CPU time.
DBlauhut commented 2 years ago

Hello, i think, that there is a problem with the single eventfd in combination with the usb driver. i modified the library to use select and multiple eventfds instead epoll on a single eventfd and the kernel panic seems to be gone. i have documented my changes in https://github.com/DBlauhut/python-functionfs/commit/34c7fd837eb5a5f3eb04f1d7ae13cb890ff778f3 but i'm not sure if i created other issues with the patch :)

vpelletier commented 2 years ago

I think you are on to something. I completely forgot that I am not using eventfd only for AIO blocks, but also for f_fs itself: to be notified of EP0 events - which are a bit of a misnomer, they are events which happen on EP0 but which are not USB protocol-level events, but rather kernel gadget subsystem events: "function enabled/disabled", ...

This made me realise that the bug is maybe not in AIO, but indeed in f_fs.

And I think I may have found the next level of this issue:

root@sushi:/sys/kernel/debug/tracing# echo ffs_data_clear > set_ftrace_filter
root@sushi:/sys/kernel/debug/tracing# echo function > current_tracer
root@sushi:/sys/kernel/debug/tracing# echo 1 > tracing_on

In another shell, run and kill a gadget, then:

root@sushi:/sys/kernel/debug/tracing# echo 0 > tracing_on
root@sushi:/sys/kernel/debug/tracing# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 3/3   #P:1
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
 smartcard-openp-434     [000] .....   793.274928: ffs_data_clear <-ffs_data_closed
 smartcard-openp-429     [000] .....   793.347753: ffs_data_clear <-ffs_data_closed
 smartcard-openp-429     [000] .n...   793.354921: ffs_data_clear <-ffs_data_put

ffs_data_clear is called 3 times, and somehow twice from ffs_data_closed. Note also that it is called for two PIDs: 434 will have been the function-level process, and 429 its gadget-level parent. The latter does the mounting, then spawns the subprocess, then waits for the subprocess to exit, and finally unmounts the FS. While tearing down, the subprocess will have closed the endpoints, which is expected to trigger the kernel-side teardown of the gadget.

I suspect the fix may be something like this (minus tabs-vs-spaces shenanigans):

diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c
index e20c19a0f106..9574f3231b0d 100644
--- a/drivers/usb/gadget/function/f_fs.c
+++ b/drivers/usb/gadget/function/f_fs.c
@@ -1776,8 +1776,10 @@ static void ffs_data_clear(struct ffs_data *ffs)
        if (ffs->epfiles)
                ffs_epfiles_destroy(ffs->epfiles, ffs->eps_count);

-       if (ffs->ffs_eventfd)
+       if (ffs->ffs_eventfd) {
                eventfd_ctx_put(ffs->ffs_eventfd);
+               ffs->ffs_eventfd = NULL;
+       }

        kfree(ffs->raw_descs_data);
        kfree(ffs->raw_strings);

I have not tested this (not even compile-tested it yet). I have no idea if this is the right place for such logic. I have no idea why there is not a similar complaint about epfiles despite it using a similar pattern (although I do see a caller NULLifying that struct member...).

vpelletier commented 2 years ago

I submitted a patch to the kernel USB subsystem list which should fix this bug.

DBlauhut commented 2 years ago

i created a pull request that may solve the issue in python-functionfs directly https://github.com/vpelletier/python-functionfs/pull/24

vpelletier commented 2 years ago

Good news everyone !

The bug is fixed in 5.16, with backports to:

As far as I can see, it is not possible to access these versions - at least not without some fuzzy parsing of a distro-dependent version string, like this:

$ uname -v
#1 SMP Debian 5.15.3-1 (2021-11-18)

The only easily accessible value lacks the minor version:

$ uname -r
5.15.0-1-amd64

So I think I will stick with the heuristic I implemented in f5d40a2a2cf3bab3f17df1fe6faf61167d012395: a simple cutoff at 5.16, and the ability to control the quirk via a class property.

vpelletier commented 2 years ago

0.9.0 released with the workaround for older kernels. Hopefully this should actually resolve this issue...