raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.07k stars 1.68k forks source link

mmal: Get component port state debug information #377

Closed julianscheel closed 7 years ago

julianscheel commented 9 years ago

Is there a similiar mechanism for mmal to OMX_GetDebugInformation for openmax which enables us to see the state of the active mmal components and their ports? I am looking for an issue where mmal_port_flush() does not seem to work properly on image_fx component - my internal refcounting says that still one buffer is in use by the input port. It would be quite helpful to have a chance to see the internal port state.

julianscheel commented 9 years ago

vcsm is still disabled in my kernel, I can enable it and set zero_copy - I just assumed it would not be relevant for the opaque configuration anyway, because buffers are tiny (see message size of 144 bytes here). vcdbg log assert is indeed always empty. Would it be possible to provide a debug firmware?

julianscheel commented 9 years ago

I am just waiting to hit the deadlock again and am following the vcdbg log in parallel. Interestingly I frequently get error on the image_fx input port, which I don't get reported back to userspace. mmal_port_send_buffer in the deinterlace filter always return MMAL_SUCCESS, but in the log I see:

2200833.819: mmal: mmal_opaque_image_get: image 0x1f4bf618 (0xd00b7d00/0xe) has invalid refcount (0)
2200833.881: mmal-opaque: mmal_ril_tunnel_process_input_buffer: ril.image_fx:in:0(OPQV) could not unpack opaque data (EINVAL): 0xd00b7d00/0xe flags 0x0 pts 2198568869 seq 82
2200833.910: mmal: mmal_port_send_buffer: ril.image_fx:in:0(OPQV): send failed: EINVAL
2200833.935: mmalsrv: on_bulk_event: could not deliver buffer: 3

The error seems to occur close to codec destruction/construction cycle in my test loop, so I assume it's about component lifecycle... Waiting for the deadlock to occur now to post related logs then...

popcornmix commented 9 years ago

Looks like VCOS_RELEASE_ASSERTS will allow asserts to appear in release firmware. I'll include that in next firmware update.

luked99 commented 9 years ago

@julianscheel They might be configured in opaque mode, but at least one of them (image_fx) doesn't support opaque and is probably using bulk transfers. Which is why you've got vchiq_queue_bulk_transmit() in the callstack....

luked99 commented 9 years ago

@popcornmix does vcbg tx -v work?

popcornmix commented 9 years ago

No. I'll ask if it's possible to enable.

julianscheel commented 9 years ago

@luked99 Why does image_fx not support opaque mode? We had this issue to integrate opaque support for image_fx: https://github.com/raspberrypi/firmware/issues/278 - And since then we use it in opaque mode, which means we push in buffers with ENCODING_OPAQUE and get back out buffers with ENCODING_OPAQUE.

6by9 commented 9 years ago

image_fx does support opaque mode on both input and output.

mmal_opaque_image_get failing probably means that the images have been released from the opaque buffer before presenting to the input port. I'd need to check how MMAL arranged to do the bypass of the bulk receive for small buffers - it may still call on_bulk_event to the handler direct from the control message context.

vcdbg tx -v requires a non-stripped ELF file, and those are not released for Pi. The version of vcdbg on the Pi branch also doesn't actually have the command (I had asked for my own use as we can provide the ELF as an extra file, and could then process ram dumps if there was an easy way to produce them).

Do we really want to enable VCOS_RELEASE_ASSERTS? I suspect there may be a fair few firing (probably linked to sysman being disabled - there are already vcos_log_errors logged in vcdbg log msg due to H264 power control going amiss). Airing dirty laundry and all that :-o

julianscheel commented 9 years ago

@6by9 I would have expected the images to be valid until the component which owns the pool (codec in this case) is finally destroyed. Is that assumption correct? If so it seems that my approach of locking the component via mmal_component_acquire ain't working yet, which would be odd. Anyway it's still questionable whether these invalid pictures are actually causing the havoc later on (still waiting for the stall... had it like 10 times in the morning, but right now it takes its time...)

6by9 commented 9 years ago

Hang on,

#2  0xb4d3d518 in mmal_vc_send_message (client=0xb4d4d444 <client>, msg_header=0xadaf5534, size=296, 
    data=0xacbf0e50 "", data_size=144, msgid=11)

data size 144? That sounds a tad large for an opaque buffer, though it may include headers.

Source for the ARM side is in userland https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_api.c#L560 MMAL_VC_SHORT_DATA is defined as 128, but line 622 has obviously picked up 144 as buffer->length, which is therefore too long. Where has that buffer header come from, as it seems wrong for an opaque buffer.

@julianscheel could you check for me what value you typically get back for buffer->length on an opaque buffer? I haven't got a setup or source to hand.

julianscheel commented 9 years ago

@6by9 just added a debug print before mmal_port_send_buffer - the buffer length is 144 all the time. So it wouldn't switch to short mode for opaque...

luked99 commented 9 years ago

`Thread 4 (Thread 0xb62ed460 (LWP 1296)):

0 0xb6cbd11c in ioctl () from /lib/libc.so.6

1 0xb4da065c in vchiq_queue_bulk_transmit (handle=2887, data=0xacbf0e50, size=144, userdata=0xadaf5534)

at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/vchiq_arm/vchiq_lib.c:412

` You wouldn't get that call stack if opaque images were being used.

julianscheel commented 9 years ago

@luked99 But 144 bytes is too short for a non-opaque image for sure, isn't it? And as 144 is more than 128 a bulk transfer is used for it.

julianscheel commented 9 years ago

Ok, I just ran into a stall - this time slightly different though as it was not caused by sending an input buffer simultaneously to teardown, but apply a new format (aspect ratio change did occur). Relevant stack parts:

Thread 4 (Thread 0xb04ff460 (LWP 12559)):
#0  0xb6dc804c in do_futex_wait () from /lib/libpthread.so.0
#1  0xb6dc8108 in sem_wait@@GLIBC_2.4 () from /lib/libpthread.so.0
#2  0xb483be78 in vcos_semaphore_wait (sem=0xb484e484 <client+64>)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/build/inc/interface/vcos/vcos_platform.h:254
#3  0xb483e184 in mmal_vc_sendwait_message (client=0xb484e444 <client>, msg_header=0xb04fe790, size=304, msgid=9, 
    dest=0xb04fe654, destlen=0xb04fe650, send_dummy_bulk=0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_client.c:593
#4  0xb4842540 in mmal_vc_port_info_set (port=0xae8f0a00)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_api.c:1015
#5  0xb484289c in mmal_vc_port_set_format (port=0xae8f0a00)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_api.c:1067
#6  0xb48705a0 in mmal_port_format_commit (port=0xae8f0a00)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/core/mmal_port.c:323

#7  0xb095e074 in configure_display (vd=0xb484e484 <client+64>, cfg=0x80, fmt=0x0)
    at ../../../../modules/hw/mmal/vout.c:439

and

Thread 6 (Thread 0xb637d460 (LWP 12555)):
#0  0xb6dc804c in do_futex_wait () from /lib/libpthread.so.0
#1  0xb6dc8108 in sem_wait@@GLIBC_2.4 () from /lib/libpthread.so.0
#2  0xb483be78 in vcos_semaphore_wait (sem=0xb484e468 <client+36>)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/build/inc/interface/vcos/vcos_platform.h:254
#3  0xb483e184 in mmal_vc_sendwait_message (client=0xb484e444 <client>, msg_header=0xb637cb2c, size=28, msgid=7, 
    dest=0xb637cb48, destlen=0xb637cb28, send_dummy_bulk=0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_client.c:593
#4  0xb4841584 in mmal_vc_component_disable (component=0xb610f3a0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_api.c:714
#5  0xb4877a34 in mmal_component_disable (component=0xb610f3a0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/core/mmal_component.c:429
#6  0xb48f5150 in CloseDecoder (dec=<error reading variable: value has been optimized out>)
    at ../../../../modules/hw/mmal/codec.c:284

From the moment where the stall occurs the vcdbg log msg loops with:

debug_sym: AccessVideoCoreMemory: mmap failed: Invalid argument(22)
Unable to read log message header from 0xdffffff9

I assume this is just vcdbg failing to talk to the core?

luked99 commented 9 years ago

That means that probably the VideoCore has decided to go scribbling over memory.

6by9 commented 9 years ago

Yes, something squiffy there. Opaque buffers must have grown and now exceed the limit. I know on other branches we had increased MMAL_VC_SHORT_DATA to either 256 or 512, so we should be able to do something similar if it has grown.

@popcornmix Could you do a quick check for me. Opaque buffers did gain cropping info or something similar as you needed it for XBMC. Could you check whether we've got a simple goof here and the opaque buffer is now > 128? Thanks.

luked99 commented 9 years ago

@julianscheel In the meantime, if you make the buffers zero-copy (yes, you can have zero-copy opaque buffers) then it's possible it might just work (no promises though).

6by9 commented 9 years ago

vcdbg doesn't need to talk to VC for most of the commands. It opens VC's memory and extracts as much as it can, but does rely on finding some symbols from memory. It does sound like VC has exploded and trampled memory. I'm afraid one of us getting JTAG onto VC is going to be the only solution here, but time is the limited resource :-(

@luked99 You're obviously busy if you responding to Pi Github issues within minutes ;-)

julianscheel commented 9 years ago

I just tried to enable zero copy. My kernel has CONFIG_BCM_VC_SM=y now, and right before port_enable I added:

    MMAL_PARAMETER_BOOLEAN_T zero_copy = {
        { MMAL_PARAMETER_ZERO_COPY, sizeof(MMAL_PARAMETER_BOOLEAN_T) },
        1
    };

    status = mmal_port_parameter_set(sys->input, &zero_copy.hdr);
    if (status != MMAL_SUCCESS) {
       msg_Err(filter, "Failed to set zero copy on port %s (status=%"PRIx32" %s)",
                sys->input->name, status, mmal_status_to_string(status));
       goto out;
    }

I did this for all participating input and output ports, but now all I get are frequent MMAL_EVENT_ERROR buffers on control_port_cb and the vcdbg log msg saying

239615.322: mmalsrv: mmal_server_do_buffer_from_host: ril mem ril.video_decod-1:failed 3

Any thoughts about that?

julianscheel commented 9 years ago

@6by9 could I just increase MMAL_VC_SHORT_DATA in the mmal userland code? Or would that require some firmware tweak as well?

6by9 commented 9 years ago

@julianscheel No, the firmware needs to match, and IIRC the VCHI message buffers need enlarging to to carry the larger max buffer size. I still want to know why it has grown, as I had remembered them being around 64 bytes on this branch.

VCSM really has not been tested on Pi. It was the only way of doing things on a later project branch, but due to a weird error condition within VCHI (VC service wasn't running but the kernel attempted to connect - boom) it was disabled on Pi until about October/November. It's on the list of things that I want to look into, but not had time for yet. Any use is potentially premature (unlike other things which are now overdue!)

luked99 commented 9 years ago

You're obviously busy if you responding to Pi Github issues within minutes ;-) Compiling, innit.... :-)

julianscheel commented 9 years ago

@6by9 Ok, I think increasing it makes sense anyway though, because messages like used for mmal_port_format_commit are bigger than 128 bytes as well. See the last stack, where it did a mmal_port_format_commit using 304 bytes. So if we'd be able to transmit 512 bytes it would probably enough to avoid all bulk copies when running in opaque mode.

julianscheel commented 9 years ago

@luked99 Maybe you should use a slower computer, so you get more compiling time ;)

6by9 commented 9 years ago

One thought on zero copy is that the buffer pool has to be allocated with mmal_port_pool_create rather than a straight mmal_pool_create (not that I can find the implementation of that via github's web interface at the moment, even though it is called from mmal_port_connection_enable). It's all linked to buffer association with ports to ensure they get cleaned up on VC if a client dies.

It won't make a difference to anything other than opaque buffers: https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_api.c#L624

if (length <= MMAL_VC_SHORT_DATA && !port->priv->module->is_zero_copy &&
(port->format->encoding == MMAL_ENCODING_OPAQUE ||
port->type == MMAL_PORT_TYPE_CLOCK))
{
memcpy(msg->short_data, buffer->data + buffer->offset, buffer->length);
msg->payload_in_message = length;
length = 0;
}
else
{
msg->payload_in_message = 0;
}

@luked99 Wrong activity for compiling! http://xkcd.com/303/

julianscheel commented 9 years ago

@6by9 switching to mmal_port_pool_create is not helping. Still getting the same errors when zero copy is requested.

6by9 commented 9 years ago

@julianscheel Ack - I'll see if I can have a play tonight to see if I can add zero copy to raspistill/vid.

julianscheel commented 9 years ago

@6by9 Thanks, could @popcornmix provide a testing firmware with bigger short data maybe?

6by9 commented 9 years ago

Opaque buffers grew back in about August to add some extra information for deinterlacing and similar.

Changing MMAL_VC_SHORT_DATA will break compatibility between random userland builds and firmware - bad news. We're looking at trimming the opaque buffer size back down to <128 - shouldn't be too bad, and will get back to the payload_in_message case.

julianscheel commented 9 years ago

Ok, I understand that breaking compatibility between userland/firmware revisions shall be avoided when possible. But if short data is not increased the cases where mmal_port_format_commit triggers the issue wouldn't be covered, right? What could we do there? Get zero_copy to work properly? Or could we possibly have some dynamic mode-switch in the firmware? When a recent userland is used it could query the firmware to activate bigger buffers?

6by9 commented 9 years ago

As per my comment above, port_format_commit will never use this method anyway The only place this code is present is as part of mmal_vc_port_send(), and then only active if the encoding is opaque.

if (length <= MMAL_VC_SHORT_DATA && !port->priv->module->is_zero_copy &&
(port->format->encoding == MMAL_ENCODING_OPAQUE ||
port->type == MMAL_PORT_TYPE_CLOCK))
{
memcpy(msg->short_data, buffer->data + buffer->offset, buffer->length);
msg->payload_in_message = length;
length = 0;
}
else
{
msg->payload_in_message = 0;
}

Zero copy is also not used on anything other than buffer passing.

julianscheel commented 9 years ago

Ah ok, got hat now. Then shrinking it down below 128 byte should be fine of course. Would the port_format_commit be using the shared memory approach when it's actually working? Or are we stuck with a bulk transfer in that case anyway?

luked99 commented 9 years ago

I wonder if it would be possible to get userland to try to read the short data length from VideoCore, and use that value if it can, or default to the current value if it can't.

On 20 March 2015 at 17:51, Julian Scheel notifications@github.com wrote:

Ok, I understand that breaking compatibility between userland/firmware revisions shall be avoided when possible. But if short data is not increased the cases where mmal_port_format_commit triggers the issue wouldn't be covered, right? What could we do there? Get zero_copy to work properly? Or could we possibly have some dynamic mode-switch in the firmware? When a recent userland is used it could query the firmware to activate bigger buffers?

— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/firmware/issues/377#issuecomment-84087413 .

popcornmix commented 9 years ago

Latest rpi-update should include asserts in log in release builds.

julianscheel commented 9 years ago

I just reproduced a deadlock with the new firmware. The assert log shows this

$ /tmp/vcdbg log assert
4018892.215: assert( object->magic == OBJECT_MAGIC ) failed; ../../../../../helpers/vc_pool/vc_pool.c::vc_pool_offsetof line 461
4018892.244: assert( is_valid_vc_image_buf(src, src->type) ) failed; ../../../../../helpers/vc_image/vc_image.c::vc_image_convert_blt line 1322
4018892.280: assert( is_valid_vc_image_buf(src, src->type) ) failed; ../../../../../helpers/vc_image/vc_image.c::vc_image_convert line 3403
4018893.500: assert( is_valid_vc_image_buf(img, img->type) ) failed; ../../../../../helpers/vc_image/vc_image.c::vc_image_unlock line 6922
4018893.526: assert( is_aligned_32(g_mgr.handles[handle].header) ) failed; ../../../../../vcfw/rtos/common/rtos_common_mem.c::get_header line 882
4018893.548: assert( h ) failed; ../../../../../vcfw/rtos/common/rtos_common_mem.c::mem_unlock line 5169
4018893.578: assert( is_aligned_32(g_mgr.handles[handle].header) ) failed; ../../../../../vcfw/rtos/common/rtos_common_mem.c::get_header line 882
4018893.602: assert( mem_get_size(src->mem_handle) >= src->size ) failed; ../../../../../helpers/vc_image/vc_image.c::vc_image_lock line 6789
4018893.637: assert( is_aligned_32(g_mgr.handles[handle].header) ) failed; ../../../../../vcfw/rtos/common/rtos_common_mem.c::get_header line 882
4018893.671: assert( h ) failed; ../../../../../vcfw/rtos/common/rtos_common_mem.c::lock_big_internal line 4880
4018893.692: assert( h && h->ref_count != 0 ) failed; ../../../../../vcfw/rtos/common/rtos_common_mem.c::lock_big_internal line 4881
4018893.712: assert( object->magic == OBJECT_MAGIC ) failed; ../../../../../helpers/vc_pool/vc_pool.c::vc_pool_offsetof line 461

This time the locking threads are calling mmal_component_disable and mmal_picture_unlock, so it seems this time not even a bulk transfer is participating? See the related stack parts:

Thread 2 (Thread 0xb09ff460 (LWP 3668)):
#0  0xb6df1bd8 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0xb6dec544 in pthread_mutex_lock () from /lib/libpthread.so.0
#2  0xb50d7154 in vcos_mutex_lock (latch=0xab8f0b74)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/build/inc/interface/vcos/vcos_platform.h:476
#3  0xb50d810c in mmal_component_release_internal (component=0xab8f0a00)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/core/mmal_component.c:275
#4  0xb50d870c in mmal_component_release (component=0xab8f0a00)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/core/mmal_component.c:355
#5  0xb184c94c in mmal_picture_unlock (picture=<optimized out>) at ../../../../modules/hw/mmal/mmal_picture.c:67
#6  0xb6587e14 in picture_pool_ReleasePicture (picture=<optimized out>) at ../../src/misc/picture_pool.c:91
#7  0xb658753c in picture_Release (p_picture=p_picture@entry=0xabb637c8) at ../../src/misc/picture.c:303
#8  0xb18e1874 in input_port_cb (port=<optimized out>, buffer=0xaf5f3988)
    at ../../../../modules/hw/mmal/deinterlace.c:581
#9  0xb50d3650 in mmal_port_buffer_header_callback (port=0xa94f0a00, buffer=0xaf5f3988)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/core/mmal_port.c:913
#10 0xb50a1a30 in mmal_vc_do_callback (component=0xab3f0cc0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_api.c:512
#11 0xb50a1a5c in mmal_vc_do_callback_loop (component=0xab3f0cc0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_api.c:518
#12 0xb50d90d0 in mmal_component_action_thread_func (arg=0xab3f0cc0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca---Type <return> to continue, or q <return> to quit---
576ba/interface/mmal/core/mmal_component.c:564
#13 0xb50ec1cc in vcos_thread_entry (arg=0xab3f0d20)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/vcos/pthreads/vcos_pthreads.c:144
#14 0xb6dea0a0 in start_thread () from /lib/libpthread.so.0
#15 0xb6cf40f0 in ?? () from /lib/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 6 (Thread 0xb63e5460 (LWP 3657)):
#0  0xb6df104c in do_futex_wait () from /lib/libpthread.so.0
#1  0xb6df1108 in sem_wait@@GLIBC_2.4 () from /lib/libpthread.so.0
#2  0xb509ce78 in vcos_semaphore_wait (sem=0xb50af468 <client+36>)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/build/inc/interface/vcos/vcos_platform.h:254
#3  0xb509f184 in mmal_vc_sendwait_message (client=0xb50af444 <client>, msg_header=0xb63e4b2c, size=28, msgid=7, 
    dest=0xb63e4b48, destlen=0xb63e4b28, send_dummy_bulk=0)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_client.c:593
#4  0xb50a2584 in mmal_vc_component_disable (component=0xab8f0a00)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/vc/mmal_vc_api.c:714
#5  0xb50d8a34 in mmal_component_disable (component=0xab8f0a00)
    at /home/julian/dev/nv/buildroot/build/nvr-101/rootfs-debug/build/rpi-userland-6f122099ca3ef1115c740f927a64ccf54ca576ba/interface/mmal/core/mmal_component.c:429
#6  0xb5362154 in CloseDecoder (dec=<error reading variable: value has been optimized out>)
    at ../../../../modules/hw/mmal/codec.c:286
#7  0xb6588ed0 in generic_stop (func=<optimized out>, ap=...) at ../../src/modules/modules.c:359
#8  0xb6589998 in vlc_module_unload (module=module@entry=0xad2f0b48, deinit=0xb6588ec0 <generic_stop>)
    at ../../src/modules/modules.c:340
6by9 commented 9 years ago

Ouch. That's attempting to manipulate an invalid image and may well be corrupting memory (stack, or more likely the reloc heap). Is the VC side dead after this happens? May well be linked to the mmal_opaque_image_get error you had commented on before - buffer management gone wrong :-( I know we did have a way of getting the call stack on each assert. That needs all the symbols from an unstripped ELF file. I wonder if we can get a suitable assert dump that can be analysed offline against a matching full ELF. Or I'll just sort out running you test case (eventually).

Distraction has now arrived, so I am at home but will be sleep deprived :-)

julianscheel commented 9 years ago

@6by9 First of all my congratulations and all the best to you and your family :) - Have a lot of fun with the new distraction!

When I had the lock today the VC was pretty much dead, yes. Actually vcgencmd version was working this time, but starting another decoder instance stalled immediately again. So if it is manipulating an invalid image could this be due to a race on component destruction? Is there anything more I can do on the userspace side to avoid it maybe?

Have you had success with shrinking the opaque buffer handles below 128 bytes again? I wonder if this would at least reduce the error cases again.

Is there anything else I can provide for debugging?

luked99 commented 9 years ago

Sorry - can you remind me where the code is (git url) that you're using, and what you're running?

Thanks.

julianscheel commented 9 years ago

@luked99 This comment contains the test package: https://github.com/raspberrypi/firmware/issues/377#issuecomment-78267842 The tarball contains the source tree along with a precompiled build for raspbian in the build-cross subdir. If you want I can push my testing tree to a public git as well. Although I think it can be even reproduced with stock vlc 3.0 git head without my latest patches.

6by9 commented 9 years ago

Thank you - all going very well at the moment.

I had thrown Dom a patch on Friday that did reduce opaque buffers back down to 128 bytes exactly. I'll ping him again over it (.... oops - stupid mail client replied to me instead of Dom. Resent now.) Particularly with all those asserts firing, I doubt it is linked. I suspect VC has trashed memory and is in a bad state (or at least one or two of the critical threads).

Unlikely to be component destruction itself. Buffer pool destruction, possibly. If the opaque buffer has been mishandled and presented after having been destroyed, then whilst it should all behave correctly and reject the buffer, I wonder if it has managed to do something bad. I'd also want to know what is calling vc_image_convert_blt (hence my desire to know the callstack that triggered the assert). There shouldn't be any real need to call it as everything should be handling the buffers as all the hardware blocks are able to handle the video in the native format. Possibly deinterlacing, but I didn't think you were using it here.

julianscheel commented 9 years ago

@6by9 deinterlacing is in fact in use in the testcase. So it's likely the image_fx plugin which calls vc_image_covert_blt

julianscheel commented 9 years ago

@6by9 In fact I do think that it worked stable without the image_fx after I added the component guards. But let me verify this with a test run tonight, just to be sure.

luked99 commented 9 years ago

The deinterlacer is using opaque for its output port, right?

@6by9 - won't that mean it has its own VC_POOL_T of VC_IMAGE_T's? Is it possible it has created that VC_POOL itself, but has not asked for it to have the auto-destroy-when-empty flag set on it? So if it just gets destroyed when the deinterlace component is destroyed, anyone on the other end of that tunnel will crash if they've still got a handle to that pool (although as you say, that should be protected).

Just a guess.

julianscheel commented 9 years ago

@luked99 The deinterlacer is using opaque for output and input ports - and yes, it creates it's own picture pool. This was quite painful to handle in VLC ;)

6by9 commented 9 years ago

@luked99 Yes, image_fx will create a new image pool for the output frames. The individual deinterlacer plugin is doing conversions from codec YUV to planar YUV as the algo needs planar. That is where the convert_blt is coming from. Image_fx appears to set up all the pools correctly with lazy destroy flags.

@julianscheel Deborah did tweak the advanced deinterlace algo not so long back to reduce the number of frames the codec had to allocate (long story). It might be worth just trying the fast algo instead to see if that helps as it can work natively on codec YUV. Otherwise I suspect it is getting an invalid source image somehow and trying to convert it, so that points to codec to deinterlace as the problem area. (I really will try to get your test image fired up tomorrow)

julianscheel commented 9 years ago

@6by9 Alright. I will keep the test with no deinterlace at all running over night. If it survives without crash I will start a test with using the fast algorithm on deinterlace. I can ramp up the number of test units tomorrow morning as well. Just have only one available remotely right now.

luked99 commented 9 years ago

On 23/03/15 21:32, Julian Scheel wrote:

@luked99 https://github.com/luked99 The deinterlacer is using opaque for output and input ports - and yes, it creates it's own picture pool. This was quite painful to handle in VLC ;)

If while it's running you do:

# vcdbg pools

then it should say if the "LAZY_DESTROY" flag is set to confirm this.

julianscheel commented 9 years ago

@luked99 Just checked it, it's set:

# /tmp/vcdbg pools
video_decodeRIL:image pool:
video_decodeRIL:image pool:
   pool_flags                   : 0x00000062: DIRECT LAZY_DESTROY
   magic                        : pOOl
   object_size                  : 648K (663712)
   nobjects                     : 23
   allocated                    : true
   overhead_size                : 64
   destroying                   : false
   refcount                     : 2
   alloc_fails                  : 18

   object 0
      magic                     : p00J
      refcount                  : 1
      mem                       : 47
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 1
      magic                     : p00J
      refcount                  : 1
      mem                       : 147
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 2
      magic                     : p00J
      refcount                  : 1
      mem                       : 97
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 3
      magic                     : p00J
      refcount                  : 1
      mem                       : 95
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 4
      magic                     : p00J
      refcount                  : 1
      mem                       : 109
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 5
      magic                     : p00J
      refcount                  : 1
      mem                       : 198
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILImageFX

   object 6
      magic                     : p00J
      refcount                  : 2
      mem                       : 44
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : ILVDecode

   object 7
      magic                     : p00J
      refcount                  : 2
      mem                       : 76
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : ILVDecode

   object 8
      magic                     : p00J
      refcount                  : 1
      mem                       : 218
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : mmal worker

   object 9
      magic                     : p00J
      refcount                  : 1
      mem                       : 187
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILImageFX

   object 10
      magic                     : p00J
      refcount                  : 1
      mem                       : 165
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILImageFX

   object 11
      magic                     : p00J
      refcount                  : 1
      mem                       : 96
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILImageFX

   object 12
      magic                     : p00J
      refcount                  : 1
      mem                       : 74
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILImageFX

   object 13
      magic                     : p00J
      refcount                  : 2
      mem                       : 123
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : MPEG#0 Outer

   object 14
      magic                     : p00J
      refcount                  : 1
      mem                       : 223
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 15
      magic                     : p00J
      refcount                  : 1
      mem                       : 133
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 16
      magic                     : p00J
      refcount                  : 1
      mem                       : 154
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 17
      magic                     : p00J
      refcount                  : 1
      mem                       : 209
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 18
      magic                     : p00J
      refcount                  : 1
      mem                       : 128
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 19
      magic                     : p00J
      refcount                  : 1
      mem                       : 219
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 20
      magic                     : p00J
      refcount                  : 1
      mem                       : 160
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 21
      magic                     : p00J
      refcount                  : 1
      mem                       : 72
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

   object 22
      magic                     : p00J
      refcount                  : 1
      mem                       : 203
      offset                    : 0
      last_thread.acquire       : ILVDecode
      last_thread.release       : MPEG#0 Outer

image_fxRIL: image pool:
image_fxRIL: image pool:
   pool_flags                   : 0x00000022: DIRECT LAZY_DESTROY
   magic                        : pOOl
   object_size                  : 621K (635904)
   nobjects                     : 9
   allocated                    : true
   overhead_size                : 64
   destroying                   : false
   refcount                     : 2
   alloc_fails                  : 10

   object 0
      magic                     : p00J
      refcount                  : 1
      mem                       : 242
      offset                    : 0
      last_thread.acquire       : ILImageFX
      last_thread.release       : ILImageFX

   object 1
      magic                     : p00J
      refcount                  : 1
      mem                       : 253
      offset                    : 0
      last_thread.acquire       : ILImageFX
      last_thread.release       : ILImageFX

   object 2
      magic                     : p00J
      refcount                  : 1
      mem                       : 258
      offset                    : 0
      last_thread.acquire       : ILImageFX
      last_thread.release       : ILImageFX

   object 3
      magic                     : p00J
      refcount                  : 1
      mem                       : 238
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILVRender

   object 4
      magic                     : p00J
      refcount                  : 1
      mem                       : 235
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILVRender

   object 5
      magic                     : p00J
      refcount                  : 0
      mem                       : 241
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : mmal worker

   object 6
      magic                     : p00J
      refcount                  : 1
      mem                       : 263
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILVRender

   object 7
      magic                     :
      refcount                  : 0
      mem                       : 240
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILImageFX

   object 8
      magic                     : p00J
      refcount                  : 2
      mem                       : 262
      offset                    : 0
      last_thread.acquire       : VCHIQ-0
      last_thread.release       : ILImageFX
julianscheel commented 9 years ago

The test without deinterlace survived the night. I started testing with the fast algorithm now.

julianscheel commented 9 years ago

I just experienced a deadlock with the fast algorithm, this time a bulk transfer for sending an opaque handle to the deinterlacer and mmal_port_disable on the codec are locking each other. The assert log shows

2901612.302: assert( 0 ) failed; ../../../../../helpers/vc_pool/image/vc_pool_image.c::vc_pool_image_release_ex line 490

And vc is completely dead (vcgencmd version not returning)