votingworks / vxsuite

https://voting.works
30 stars 6 forks source link

Understand and fix sporadic PDI driver failures / timeouts on startup #4755

Open arsalansufi opened 6 months ago

arsalansufi commented 6 months ago

Because we retry automatically, this currently doesn't have any affect on UX - users won't notice when this happens. But would still be good to understand why this is happening under the hood.

Update: There may be something more going on here per https://votingworks.slack.com/archives/CEL6D3GAD/p1722286479538139

eventualbuddha commented 2 months ago

I checked that the libusb packaged with Debian 12.6 (libusb-1.0.26.11724) exhibits the timeout behavior described in Slack. I compiled libusb-1.0.27.11882 instead (since v1.0.27 seems not to exhibit the timeout behavior on macOS), but it behaves the same way. I ran this command to get my built-and-installed-to-/usr/local/lib version of libusb to be loaded:

D_LIBRARY_PATH=/usr/local/lib cargo run --example get_test_string --release -- --times 5

I also tried adding a 2s sleep in between each retry and that does seem to help it recover:

01: get_test_string result: Err(RecvTimeout(Timeout))
02: get_test_string result: Ok(" Test Message USB 1.1/2.0 Communication")
03: get_test_string result: Ok(" Test Message USB 1.1/2.0 Communication")
04: get_test_string result: Ok(" Test Message USB 1.1/2.0 Communication")
05: get_test_string result: Ok(" Test Message USB 1.1/2.0 Communication")

Whether this behavior will carry over to the actual binary I'm not sure, but it does fit with an observation that the official PDI driver does seem to take a little while to report a successful connection (O(seconds)).

eventualbuddha commented 2 months ago

I've tried modifying get_test_string to alternate between the GetTestString command and the GetFirmwareVersion commands. If I run it long enough it always eventually times out on one of the requests (maybe always GetTestString?). Most of the time it takes just a few milliseconds to get a response, but when it times out it apparently doesn't receive anything for a full second.

I tried to go back to a more basic reproduction using a simple C program. Here's what I ended up with:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <libusb-1.0/libusb.h>
#include <pthread.h>
#include <unistd.h>

/// Vendor ID for the PDI scanner.
#define VENDOR_ID 0x0bd7

/// Product ID for the PDI scanner.
#define PRODUCT_ID 0xa002

int send_command_and_print_response(
    libusb_context *context,
    struct libusb_transfer *transfer_command_out,
    struct libusb_transfer *transfer_command_in,
    unsigned char *command_out,
    unsigned char *command_in)
{
    // have incoming transfer ready before sending
    int result = libusb_submit_transfer(transfer_command_in);
    printf("transfer_command_in->status = %d\n", transfer_command_in->status);

    if (result != LIBUSB_SUCCESS)
    {
        printf("libusb_submit_transfer(transfer_command_in) failed with %d\n", result);
        return result;
    }

    result = libusb_submit_transfer(transfer_command_out);

    if (result != LIBUSB_SUCCESS)
    {
        printf("libusb_submit_transfer(transfer_command_out) failed with %d\n", result);
        return result;
    }

    struct timeval tv;
    tv.tv_sec = 1;
    tv.tv_usec = 0;

    // wait for outgoing command to complete
    result = libusb_handle_events_timeout(context, &tv);

    if (result != LIBUSB_SUCCESS)
    {
        printf("libusb_handle_events failed with %d\n", result);
        return result;
    }

    if (transfer_command_out->status != LIBUSB_TRANSFER_COMPLETED)
    {
        printf("expected transfer_command_out->status == LIBUSB_TRANSFER_COMPLETED, got %d\n", transfer_command_out->status);
    }

    printf("Command sent\n");

    // receive response

    printf("Receiving response…\n");

    // wait for incoming command to complete
    result = libusb_handle_events_timeout(context, &tv);

    if (result != LIBUSB_SUCCESS)
    {
        printf("libusb_handle_events failed with %d\n", result);
        return result;
    }

    if (transfer_command_in->status != LIBUSB_TRANSFER_COMPLETED)
    {
        printf("expected transfer_command_in->status == LIBUSB_TRANSFER_COMPLETED, got %d\n", transfer_command_in->status);
    }

    // print response
    printf("Response received (length=%d): ", transfer_command_in->actual_length);
    for (int i = 0; i < transfer_command_in->actual_length; i++)
    {
        printf("%02x ", command_in[i]);
    }

    printf("\n");

    // print response up to a given length
    int max_length = transfer_command_in->actual_length;
    printf("Response as string: %.*s\n", max_length, command_in);

    return LIBUSB_SUCCESS;
}

int main()
{
    const struct libusb_version *version = libusb_get_version();
    printf("libusb v%d.%d.%d.%d\n",
           version->major, version->minor, version->micro, version->nano);

    libusb_context *context = NULL;
    int result = libusb_init(&context);

    if (result < 0)
    {
        printf("libusb_init failed with %d\n", result);
        return 1;
    }

    libusb_device **list = NULL;
    ssize_t count = libusb_get_device_list(context, &list);

    if (count < 0)
    {
        printf("libusb_get_device_list failed with %ld\n", count);
        libusb_exit(context);
        return 1;
    }

    printf("Found %ld devices\n", count);

    libusb_device *scanner_device = NULL;

    for (ssize_t i = 0; i < count; i++)
    {
        libusb_device *device = list[i];
        struct libusb_device_descriptor descriptor;
        result = libusb_get_device_descriptor(device, &descriptor);

        if (result < 0)
        {
            printf("libusb_get_device_descriptor failed with %d\n", result);
            libusb_exit(context);
            return 1;
        }

        if (descriptor.idVendor == VENDOR_ID && descriptor.idProduct == PRODUCT_ID)
        {
            scanner_device = device;
            break;
        }
    }

    libusb_free_device_list(list, 1);

    if (scanner_device == NULL)
    {
        printf("Scanner device not found\n");
        libusb_exit(context);
        return 1;
    }

    libusb_device_handle *handle = NULL;
    result = libusb_open(scanner_device, &handle);

    if (result < 0)
    {
        printf("libusb_open failed with %d\n", result);
        libusb_exit(context);
        return 1;
    }

    printf("Scanner device opened\n");

    result = libusb_claim_interface(handle, 0);

    if (result < 0)
    {
        printf("libusb_claim_interface failed with %d\n", result);
        libusb_close(handle);
        libusb_exit(context);
        return 1;
    }

    struct libusb_transfer *transfer_command_out = libusb_alloc_transfer(0);
    struct libusb_transfer *transfer_command_in = libusb_alloc_transfer(0);

    // allocate buffers
    unsigned char *command_out = malloc(5000);
    unsigned char *command_in = malloc(12000000);

    assert(transfer_command_out != NULL);
    assert(transfer_command_in != NULL);

    libusb_fill_bulk_transfer(transfer_command_out, handle, 0x05, command_out, 0, NULL, NULL, 0);
    transfer_command_out->status = LIBUSB_TRANSFER_NO_DEVICE;
    libusb_fill_bulk_transfer(transfer_command_in, handle, 0x85, command_in, 12000000, NULL, NULL, 0);
    transfer_command_in->status = LIBUSB_TRANSFER_NO_DEVICE;

    // result = libusb_reset_device(handle);

    // if (result < 0)
    // {
    //     printf("libusb_reset_device failed with %d\n", result);
    //     goto EXIT;
    // }

    while (1)
    {
        // send command "D": 0x02 0x44 0x03 0xb4
        command_out[0] = 0x02;
        command_out[1] = 0x44;
        command_out[2] = 0x03;
        command_out[3] = 0xb4;
        transfer_command_out->length = 4;

        result = send_command_and_print_response(context, transfer_command_out, transfer_command_in, command_out, command_in);

        if (result != LIBUSB_SUCCESS)
        {
            goto EXIT;
        }

        usleep(2000); // 2 ms

        // send command "V": 0x02 0x56 0x03 0xb7
        command_out[0] = 0x02;
        command_out[1] = 0x56;
        command_out[2] = 0x03;
        command_out[3] = 0xb7;
        transfer_command_out->length = 4;

        result = send_command_and_print_response(context, transfer_command_out, transfer_command_in, command_out, command_in);

        if (result != LIBUSB_SUCCESS)
        {
            goto EXIT;
        }

        usleep(1000000); // 1 s
    }

EXIT:

    libusb_release_interface(handle, 0);
    libusb_close(handle);
    libusb_exit(context);

    return 0;
}

It fails after a little while on Debian Linux in a VM on Parallels on macOS with error code -6 (LIBUSB_ERROR_BUSY). When run directly on macOS (exact same hardware) it runs without error even after hundreds of request/response pairs.

eventualbuddha commented 2 months ago

Running this on Ubuntu 20 seems to have essentially the same behavior as Debian 12:

libusb v1.0.23.11397
Found 7 devices
Scanner device opened
transfer_command_in->status = 5
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=12): 02 56 39 33 35 38 37 30 33 31 28 03 
Response as string: V93587031(
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=12): 02 56 39 33 35 38 37 30 33 31 28 03 
Response as string: V93587031(
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=12): 02 56 39 33 35 38 37 30 33 31 28 03 
Response as string: V93587031(
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=12): 02 56 39 33 35 38 37 30 33 31 28 03 
Response as string: V93587031(
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=12): 02 56 39 33 35 38 37 30 33 31 28 03 
Response as string: V93587031(
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=12): 02 56 39 33 35 38 37 30 33 31 28 03 
Response as string: V93587031(
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=12): 02 56 39 33 35 38 37 30 33 31 28 03 
Response as string: V93587031(
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
Command sent
Receiving response…
Response received (length=42): 02 44 20 54 65 73 74 20 4d 65 73 73 61 67 65 20 55 53 42 20 31 2e 31 2f 32 2e 30 20 43 6f 6d 6d 75 6e 69 63 61 74 69 6f 6e 03 
Response as string: D Test Message USB 1.1/2.0 Communication
transfer_command_in->status = 0
libusb_submit_transfer(transfer_command_in) failed with -6

The double test string response at the end is weird.

eventualbuddha commented 2 months ago

Debian 12 VM running under UTM (via QEMU) instead of Parallels does not exhibit the buggy behavior. May be time to say goodbye to Parallels.