raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.16k stars 1.68k forks source link

mmal_vc_port_disable() stuck in mmal_vc_sendwait_message() #1428

Open janus926 opened 4 years ago

janus926 commented 4 years ago

Describe the bug I was working with the RPi camera using picamera v1.13 and found the python doesn't return from stop_recording() sometimes. It is blocked in mmal_vc_sendwait_message() from mmal_port_disable(). So I tried to reproduce with the default image and raspivid, and found it is also reproducible.

To reproduce Run command while true; do raspivid -w 640 -h 480 -b 450000 -fps 24 -t 5000 -v -o tcp://x.x.x.x:yyyy; sleep 3; done, then just leave it there for a while (usually it takes hours to reproduce).

Expected behaviour mmal_port_disable() to return.

Actual behaviour mmal_port_disable() stuck in mmal_vc_sendwait_message(). If I stop raspivid (Ctrl+C) when the issue occurs and rerun it, it will be blocked in the first call to mmal_vc_sendwait_message(). A reboot is required to recover.

System Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:

Logs

 mmal: mmal_port_disconnect: vc.ril.camera:out:0(OPQV)(0xaec1b0)
 mmalipc: mmal_vc_sendwait_message: wait 0xb6efd3c4, reply to 0xbebb5538
 mmalipc: mmal_vc_vchiq_callback: reason 2
 mmalipc: mmal_vc_vchiq_callback: waking up waiter at 0xb6efd3c4
 mmalipc: mmal_vc_vchiq_callback: copying payload @0xbebb5538 to 0xb4b08718 len 28
 mmalipc: mmal_vc_sendwait_message: got reply (len 28/28)
 mmalipc: release_waiter: at 0xb6efd3c4
 mmal: mmal_connection_destroy: 0xb03490, vc.ril.camera:out:1/vc.ril.video_encode:in:0
 mmal: mmal_connection_disable: 0xb03490, vc.ril.camera:out:1/vc.ril.video_encode:in:0
 mmal: mmal_port_disable: vc.ril.camera(3:1) port 0xaec890
 mmalipc: mmal_vc_sendwait_message: wait 0xb6efd3c4, reply to 0xbebb54f0

Additional context I added some logs to vchiq_2835_arm.c in the linux kernel, and found vchiq_doorbell_irq() for BELL0 wasn't triggered.

[17452.138027] vchiq: vchiq_ioctl - instance b52957a8, cmd AWAIT_COMPLETION, arg b6c62d7c
[17452.143248] vchiq: vchiq_ioctl - instance b52957a8, cmd QUEUE_MESSAGE, arg be8e1b64
[17452.143712] vchiq: 0: qm DATA@338f3751,64 (12->90)
[17452.143755] vchiq: Sent: 00000000: 6d 6d 61 6c 0a 00 00 00 e8 05 00 00 dc c1 f1 b6 mmal............
[17452.143768] vchiq: Sent Msg DATA(5) to mmal s:12 d:90 len:100
[17452.143778] vchiq: remote_event_signal BELL2 a=1 e=525227796
[17452.143789] vchiq:   ioctl instance d7879000, cmd QUEUE_MESSAGE -> status 0, 0
# 
6by9 commented 4 years ago

Can you try adding start_debug=1 to config.txt, and then run sudo vcdbg log assert when it locks up?

janus926 commented 4 years ago

I have been trying for more than a week, but haven't succeeded to reproduce it with start_debug=1. Note I can still reproduce it without start_debug=1.

janus926 commented 4 years ago

I haven't succeeded reproducing this with start_debug=1 on RPi using the official image. However I reproduced once with fixup_db.dat and start_db.elf using buildroot image (rpi-firmware-1.20200601, rpi-userland-188d3bfe4a0ac36b119a2cee35a6be8d0c68e09e, Linux 4.19.118 PREEMPT). The first four asserts were existed after boot up. Not sure if this helps.

# vcdbg log assert
001514.142: assert( source ) failed; ../../../../../middleware/confzilla/cp_front_fdt.c::cp_front_fdt_load_builtin line 115 rev 82f9bb9
vcdbg_ctx_get_dump_stack: dump_stack failed
----------------
001550.508: assert( edid_state[hdmi_core].i2c_driver ) failed; ../../../../../middleware/hdmi/edid.c::hdmi_edid_shutdown line 736 rev 82f9bb9
vcdbg_ctx_get_dump_stack: dump_stack failed
----------------
006614.091: assert( card not initialised ) failed; ../../../../../filesystem/media/sdcard/sdcard.c::sdhost_close line 1045 rev 82f9bb9
vcdbg_ctx_get_dump_stack: dump_stack failed
----------------
011162.790: assert( camera_subsystem_state.power_enable_ref_count[port] > 0 ) failed; ../../../../../vcfw/drivers/device/camera_subsystem/camera_subsystem.c::camera_disable_ctrl line 1141 rev 82f9bb9
vcdbg_ctx_get_dump_stack: dump_stack failed
----------------
647886.630: assert( st == 0 ) failed; ../../../../../middleware/openmaxil/components/ve_worker.c::ve_worker_close line 480 rev 82f9bb9
vcdbg_ctx_get_dump_stack: dump_stack failed
----------------
# 
6by9 commented 4 years ago

That last assert in ve_worker_close is that it's asked part of the codec to disable, and it hasn't done so within a second, so something would appear to have gone fairly badly wrong.

Can I confirm that this is H264 encoding (not MJPEG)? How "busy" a scene are you encoding?

Sorry I'm busy on other things at present, but could you try dropping the bitrate lower still and see if that increases the reproduction rate? I'm just wondering if things go weird if the bitrate control decided to drop the last frame submitted as we'd exceeded the target bitrate, so that there isn't a corresponding output frame for the input. Unfortunately that'll be down in the bowels of the codec so quite an involved one to debug.

janus926 commented 4 years ago

Can I confirm that this is H264 encoding (not MJPEG)? How "busy" a scene are you encoding?

Yes, this is H264 encoding. I just leave the camera on my desk (in the office), which is not busy.

Sorry I'm busy on other things at present, but could you try dropping the bitrate lower still and see if that increases the reproduction rate?

Sure, I'll try that and update later.

janus926 commented 4 years ago

I shorten the sleep time from 3s to 1s between raspivid invocations, and try to make a "busy" scene by shaking a paper card in front of the camera. The tests run for 5 minutes each.

bitrate reproduce rate
450000 0/5
200000 0/5
50000 4/5
6by9 commented 4 years ago

Thanks. It sounds like it is the codec getting upset should the last frame be discarded due to bitrate control. It's an area I knew could sometimes complain somewhat, but I hadn't seen a full lockup. I'll see if I can find some time to investigate this week.

HughWarrington commented 4 years ago

I seem to be seeing the same problem on the RPi4. It looks like @janus926 has already got a lot further, but here's my details anyway. Let me know if I can provide more useful info.

Today the video reader process on one of our devices froze. Nothing obviously unique about the device we saw it on; this is the first time I have seen this problem in ~10k hours of the same code running across a small fleet of identical devices. There was a partly written video file still open with no further bytes being appended. Restarting the process didn't help but rebooting the device fixed it.

I'm not familiar with how to debug this kind of problem but I ran the following, which led me via Google to this ticket.

$ gdb python 17

(gdb) bt
#0  0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
#1  0xf79387d8 in do_futex_wait.constprop () from /lib/arm-linux-gnueabihf/libpthread.so.0
#2  0xf79388b8 in __new_sem_wait_slow.constprop.1 () from /lib/arm-linux-gnueabihf/libpthread.so.0
#3  0xf6a08da4 in mmal_vc_sendwait_message () from /opt/vc/lib/libmmal_vc_client.so
#4  0xf6a0bbd4 in mmal_vc_port_disable () from /opt/vc/lib/libmmal_vc_client.so
#5  0xf69b793c in mmal_port_disable_internal () from /opt/vc/lib/libmmal_core.so
#6  0xf69b8530 in mmal_port_disable () from /opt/vc/lib/libmmal_core.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) info threads
  Id   Target Id                                    Frame 
* 1    Thread 0xf7bf1010 (LWP 17) "python3"         0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
  2    Thread 0xf68b1460 (LWP 19) "VCHIQ completio" 0xf7825f08 in ioctl () from /lib/arm-linux-gnueabihf/libc.so.6
  3    Thread 0xf5eff460 (LWP 20) "HDispmanx Notif" 0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
  4    Thread 0xf56fe460 (LWP 21) "HTV Notify"      0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
  5    Thread 0xf4efd460 (LWP 22) "HCEC Notify"     0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
  6    Thread 0xf46fc460 (LWP 24) "vc.ril.camera"   0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
  7    Thread 0xf3efb460 (LWP 25) "vc.null_sink"    0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
  8    Thread 0xf36fa460 (LWP 26) "vc.ril.video_sp" 0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
  9    Thread 0xf24ff460 (LWP 28) "vc.ril.video_en" 0xf793b524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
HughWarrington commented 4 years ago

Seen again today, I think. I had the same symptom of the Python process getting stuck, running the same code and hardware type as yesterday but on a different physical device.

The stack trace is slightly different. It does involve mmal_vc_sendwait_message but not mmal_vc_port_disable which is why I thought it would be worth posting here.

Please let me know if this turns out to be irrelevant, and I will file a separate ticket.

(gdb) bt
#0  0xf7cb6524 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
#1  0xf7cb37d8 in do_futex_wait.constprop () from /lib/arm-linux-gnueabihf/libpthread.so.0
#2  0xf7cb38b8 in __new_sem_wait_slow.constprop.1 () from /lib/arm-linux-gnueabihf/libpthread.so.0
#3  0xf6d43da4 in mmal_vc_sendwait_message () from /opt/vc/lib/libmmal_vc_client.so
#4  0xf6d465f0 in mmal_vc_component_create () from /opt/vc/lib/libmmal_vc_client.so
#5  0xf6cf5928 in mmal_component_supplier_create () from /opt/vc/lib/libmmal_core.so
#6  0xf6cf63b4 in mmal_component_create_core () from /opt/vc/lib/libmmal_core.so
#7  0xf6e0abd8 in ffi_call_VFP () from /usr/lib/arm-linux-gnueabihf/libffi.so.6
#8  0xf6e0b26c in ffi_call () from /usr/lib/arm-linux-gnueabihf/libffi.so.6

NB there were another 57 (!) stack frames but I've left them out since they were all in Python.

Omccormack-blip commented 3 years ago

Hi all - this problem is holding back our machine vision project. @6by9 I wonder if you've had the chance to look into it? I could pay for your/someone's time to bring it up the priority list if you think you can solve it!

6by9 commented 3 years ago

@Omccormack-blip have you confirmed that it is exactly the same assert firing in your case? start_debug=1 in config.txt, and then sudo vcdbg log assert after it has locked up.

There's the ugly way of solving this (just shut down always), and the clean way (find the right flags and set them appropriately when RC asks for a skip). I'll see what I can come up with.

HughWarrington commented 3 years ago

I have tried putting start_debug=1 in config.txt on my Pi 4, but the device will then no longer boot.

HughWarrington commented 3 years ago

Any update on this? We have gone to some lengths to write a watchdog to detect this problem and reboot on our fleet of ~100 devices. This seems to work except just now and then the problem appears to spill over into stuck kernel workers and other general carnage so even our watchdog can't reboot the device.

janus926 commented 3 years ago

It sounds like it is the codec getting upset should the last frame be discarded due to bitrate control. It's an area I knew could sometimes complain somewhat, but I hadn't seen a full lockup. I'll see if I can find some time to investigate this week.

@6by9

It has been a while, are there any chances to move this forward? Thanks.

rolandjitsu commented 2 years ago

I think I've stumbled upon the same issue. After a number of restarts of the binary that uses the camera and invokes the mmal_port_disable func, the function gets stuck at:

mmal: mmal_port_disable: vc.ril.rawcam(3:0) port 0x6b200db0

Upon the next start of the binary, the camera is stuck and the only way to resolve this is to reboot the pi.