openbmc / btbridge

Apache License 2.0
6 stars 9 forks source link

btbridge crashes after 5-100 host power cycles #10

Closed nkskjames closed 8 years ago

nkskjames commented 8 years ago

btbridged crashes after 5-100 host power cycles. After crash, if I manually restart, everything is fine again.

Code path:

    if (bt_msg->call) {
        r = sd_bus_message_new_method_return(bt_msg->call, &msg);        <---- this is failing
        if (r < 0) {
            MSG_ERR("Couldn't create response message\n");             
            goto out;                                   <----bt_msg->call is still valid
        }
        r = 0; /* Just to be explicit about what we're sending back */
        r = sd_bus_message_append(msg, "x", r);
        if (r < 0) {
            MSG_ERR("Couldn't append result to response\n");
            goto out;
        }

    }
}
   // out: label is here.  removed so github would format properly
if (bt_msg->call) {
    if (sd_bus_send(context->bus, msg, NULL) < 0)                          <--- and this is failing and msg is probably null from previous error
        MSG_ERR("Couldn't send response message\n");                <--- I see this message print out then seg fault happens
    sd_bus_message_unref(msg);                          
}

} ...

btbridged --verbose

[BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d

    <This happens 1000's of times>

[BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Sending dbus signal with seq 0x0e, netfn 0x00, lun 0x00, cmd 0x09 [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Received a dbus response for msg with seq 0x0e [BTBRIDGED] 1446346408: Processed 1 dbus events [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0d [BTBRIDGED] 1446346408: Sending dbus signal with seq 0x0e, netfn 0x00, lun 0x00, cmd 0x09 [BTBRIDGED] 1446346408: Adjusting timer for next element [BTBRIDGED] 1446346408: Successfully wrote 5 of 5 bytes to /dev/bt-host [BTBRIDGED] 1446346408: Completed request with seq 0x0d, netfn 0x07, lun 0x00, cmd 0x08, cc 0xce [BTBRIDGED] 1446346408: Message with seq 0x0e is being timed out despite appearing to have been responded to. Slow BT? [BTBRIDGED] 1446346408: Timeout on msg with seq: 0x0e [BTBRIDGED] 1446346408: Received a dbus response for msg with seq 0x0e [BTBRIDGED] 1446346408: Processed 1 dbus events [BTBRIDGED] 1446346408: Sending dbus signal with seq 0x0e, netfn 0x00, lun 0x00, cmd 0x09 [BTBRIDGED] 1446346408: Adjusting timer for next element [BTBRIDGED] 1446346408: Successfully wrote 12 of 12 bytes to /dev/bt-host [BTBRIDGED] 1446346408: Turning off POLLOUT for the BT in poll() [BTBRIDGED] 1446346408: Completed request with seq 0x0e, netfn 0x01, lun 0x00, cmd 0x09, cc 0x00 [BTBRIDGED] 1446346408: Received a dbus response for msg with seq 0x0e [BTBRIDGED] 1446346408: Processed 1 dbus events [BTBRIDGED] 1446346408: Adjusting timer for next element [BTBRIDGED] 1446346408: Successfully wrote 12 of 12 bytes to /dev/bt-host [BTBRIDGED] 1446346408: Turning off POLLOUT for the BT in poll() [BTBRIDGED] 1446346408: Completed request with seq 0x0e, netfn 0x01, lun 0x00, cmd 0x09, cc 0x00 [BTBRIDGED] 1446346408: Sending dbus signal with seq 0x0f, netfn 0x00, lun 0x00, cmd 0x09 [BTBRIDGED] 1446346408: Received a dbus response for msg with seq 0x0f [BTBRIDGED] 1446346408: Processed 1 dbus events [BTBRIDGED] 1446346408: Sending dbus signal with seq 0x0f, netfn 0x00, lun 0x00, cmd 0x09 [BTBRIDGED] 1446346408: Successfully wrote 12 of 12 bytes to /dev/bt-host [BTBRIDGED] 1446346408: Completed request with seq 0x0f, netfn 0x01, lun 0x00, cmd 0x09, cc 0x00 [BTBRIDGED] 1446346408: Adjusting timer for next element [BTBRIDGED] 1446346408: Successfully wrote 12 of 12 bytes to /dev/bt-host [BTBRIDGED] 1446346408: Couldn't create response message [BTBRIDGED] 1446346408: Couldn't send response message * Error in `./btbridged_msg': double free or corruption (fasttop): 0x00027878 *

nkskjames commented 8 years ago

After further investigation, it seems petitboot is sending the same sequence number twice in a row and since btbridged uses the sequence number as an identifier in the queue, can it possibly lookup wrong entry? This trace has a custom printf: printf("bt_host_write: 0x%08X,%d,%d, 0x%08X\n",bt_msg,bt_msg->req.seq,bt_msg->rsp.seq,bt_msg->call);

BTBRIDGED] Successfully wrote 12 of 12 bytes to /dev/bt-host bt_host_write: 0x000256F0,15,15, 0x00027BB8 [BTBRIDGED] Successfully wrote 12 of 12 bytes to /dev/bt-host bt_host_write: 0x000257D8,15,15, 0x00027BB8 [BTBRIDGED] Successfully wrote 12 of 12 bytes to /dev/bt-host bt_host_write: 0x000257D8,16,16, 0x00027BB8 [BTBRIDGED] Successfully wrote 12 of 12 bytes to /dev/bt-host bt_host_write: 0x00026870,16,16, 0x00027BB8 [BTBRIDGED] Successfully wrote 12 of 12 bytes to /dev/bt-host bt_host_write: 0x00026870,17,17, 0x000255B8 [BTBRIDGED] Successfully wrote 12 of 12 bytes to /dev/bt-host bt_host_write: 0x00027CF8,17,17, 0x000255B8 [BTBRIDGED] Successfully wrote 12 of 12 bytes to /dev/bt-host

jk-ozlabs commented 8 years ago

@cyrilbur-ibm , could you check this out?

cyrilbur-ibm commented 8 years ago

@nkskjames, but that shouldn't cause a double free, when btbridge receives a request it enqueues onto the end of its list, so it will have a list which might look like: [x]->[y]->[z] then say it receives another message with a sequence number z (which I'll identify as z') so: [x]->[y]->[z]->[z'] Anyone who does a lookup for z should get reference for z and from there they'll deal with it and dequeue it, next time around a reference for z would get z' and the same thing would happen.

I HAVE found a bug in bt_q_drop()!

Could you test with: https://github.com/cyrilbur-ibm/btbridge/tree/issues_10 ? Thanks!

nkskjames commented 8 years ago

I have rebooted about 200 times and no crashes. I then put old code on to make sure it still crashes. It does. So problem solved. Please do pull request. Thanks!!