OS X: Unable to find previously closed device / random crashes when closing device

davidaramantSEP commented 11 years ago

OS X 10.8.3, latest version from git, USB device is known to be good (we have been using it for 7 years on Windows).

Our application has a talks to a proprietary USB IR dongle (our 'transport') which in turn talks to a variety of IR devices. Our connect loop looks something like this in pseudo code:

scan for and open all supported USB transports    
    for each transport
        for each device protocol we support
            write to transport (this instructs it to send IR commands)
            read with timeout
close all open devices

The above sequence happens in a loop with a timeout (around 30 seconds). There are usually multiple writes/reads for each device protocol; the end result being that we send tons of these requests to the HID layer in a short period of time.

We are seeing problems with the hidapi library that I think are related:

If a device is unplugged and replugged, the library will never find it again when returning the attached USB devices. I just put in a hack that activates/deactivates the library before/after each connection attempt - with this hack, it does find the device again.
When closing a device, there is a fair chance it will crash and bring down the entire process. We have resorted to sleeping the thread that accesses hidapi for 50ms after each read/write/close. This helps, but does not eliminate the problem.
After a while, writes will start taking 15s to execute. It will most likely crash when closing the device if this happens.

It feels to me like the device is not being properly closed or there is a race condition. The stack traces following a crash always include something about HID callbacks, which seems related:

Thread 1 (process 60420):
#0  0x99ce5095 in __wait4 ()
#1  0x95a28996 in waitpid$UNIX2003 ()
#2  0x000bc972 in mono_handle_native_sigsegv (signal=11, ctx=0x9dafe0) at mini-exceptions.c:2331
#3  0x0005da05 in mono_arch_handle_altstack_exception (sigctx=0x9dafe0, fault_addr=0x70000020, stack_ovf=0) at exceptions-x86.c:1135
#4  0x000e90d1 in mono_sigsegv_signal_handler (_dummy=11, info=0x9dafa0, context=0x9dafe0) at mini.c:6379
#5  <signal handler called>
#6  0x970dda87 in objc_msgSend ()
#7  0x00000014 in ?? ()
#8  0x940e2516 in __IOHIDDeviceReportCallbackRegistered ()
#9  0x09a3d80b in IOHIDDeviceClass::_hidReportHandlerCallback ()
#10 0x09a3fcf1 in IOHIDQueueClass::queueEventSourceCallback ()
#11 0x09a3b941 in IOHIDDeviceClass::_cfmachPortCallback ()
#12 0x9602847f in __CFMachPortPerform ()
#13 0x96028335 in __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ ()
#14 0x96027f62 in __CFRunLoopDoSource1 ()
#15 0x9605dcfa in __CFRunLoopRun ()
#16 0x9605d02a in CFRunLoopRunSpecific ()
#17 0x9605ce9b in CFRunLoopRunInMode ()
#18 0x98f37f5a in RunCurrentEventLoopInMode ()
#19 0x98f37cc9 in ReceiveNextEventCommon ()
#20 0x98f37b44 in BlockUntilNextEventMatchingListInMode ()
#21 0x963779aa in _DPSNextEvent ()
#22 0x963771dc in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#23 0x9636d63c in -[NSApplication run] ()
#24 0x96310666 in NSApplicationMain ()
#25 0x043871e3 in ?? ()
#26 0x04387008 in ?? ()
#27 0x00972fcc in ?? ()
#28 0x00973117 in ?? ()
#29 0x000f251d in mono_jit_runtime_invoke (method=0x7c211c1c, obj=0x0, params=0xbfff7758, exc=0x0) at mini.c:6261
#30 0x001d90fa in mono_runtime_invoke (method=0x7c211c1c, obj=0x0, params=0xbfff7758, exc=0x0) at object.c:2830
#31 0x001dbeec in mono_runtime_exec_main (method=0x7c211c1c, args=0x99bf30, exc=0x0) at object.c:4072
#32 0x001db0ac in mono_runtime_run_main (method=0x7c211c1c, argc=1, argv=0xbfff7940, exc=0x0) at object.c:3688
#33 0x000566e5 in mono_jit_exec (domain=0x50ae00, assembly=0x7ab38d60, argc=2, argv=0xbfff793c) at driver.c:954
#34 0x0000a43e in main ()

With the hack to activate/deactivate hidapi I saw a different crash which also seems to be related to threading:

Attaching to process 59957.
/tmp/mono-gdb-commands.E7Hq30:1: Error in sourced command file:
error on line 986 of "/SourceCache/gdb/gdb-1824/src/gdb/macosx/macosx-nat-mutils.c" in function "thread_t macosx_primary_thread_of_task(task_t)": (ipc/send) invalid destination port (0x10000003)

There are a number of complications with us attempting to debug this issue:

The application is a port of a Windows .NET application. We are using Xamarin.Mac to run the app using Mono. The only native code we are directly using is hidapi.
I cannot reproduce any of these issues using the Test GUI. I might try later to hack it by automating a bunch of writes/reads so it more closely mimics the behavior of our app.
Since we are using a proprietary USB device, there's probably no way we could provide code for anyone to reproduce the issue.
We know virtually nothing about native Mac development.

We're kind of stuck. I think with a combination of hacks we might be able to limp along, but it would be better if we got the bottom of what is happening. Any ideas or suggestions you have would be appreciated.

signal11 commented 11 years ago

Sounds similar to #103 . Apple changed something in Mountain Lion. They break stuff related to HID in every major release. On top of that, there are races that I'm not sure there are any good ways to work around.

I don't have 10.8, and I don't know what it is that changed, unfortunately. dantulurisudheer says he got it working. I asked him for a patch, and haven't heard from him (I just pinged him again).

If you can try to reproduce the problem with a simple command line program (or some testgui hacks), it might be easier to diagnose the problem.

mrpippy commented 11 years ago

When you say that it happens in a loop with a 30 second timeout, do you mean the timeout on the read is 30 seconds? (so it waits for 30 seconds in case the device sends data, if not then it moves to the next protocol/transport?)

At what point in the sequence are you disconnecting/reconnecting the device? During the timeout? Also, do you have multiple 'transports'/matching devices connected at the same time? Or just one device?

signal11 commented 11 years ago

Hey mrpippy. It's been a while.

Are you using 10.8 without problems?

davidaramantSEP commented 11 years ago

mrpippy: 30 seconds refers to the entire connection attempt. We will execute that pseudocode loop for roughly 30 seconds (we don't interrupt the loop, which is the "roughly" 30 seconds part). The read timeout is 2.5 seconds but the transport returns data much faster than that (I would guess 100 ms). Multiple transports is probably something I shouldn't even have mentioned, since I'm only testing with one for now.

I'm not sure at what point I unplugged the device since the connection attempt it is pretty speedy.

signal11: I saw issue 103 earlier and it looked promising... Hopefully you will hear back about it.

The Apple dev world is proving to be quite an experience for us so far, coming from MS land.

I'll work on making a simpler program to reproduce the issue. I'll have to brush up on my C++ :)

mrpippy commented 11 years ago

signal11: When 10.8 came out I tested my app and was happy with how it worked. It's using an older/slightly hacked-up version of hidapi though, and if any users are having problems the complaints might not make it back to me.

Also I just compared the HID Manager source code (it's hid.subproj, part of IOKitUser) between 10.7 and 10.8, and there's almost no changes to the HID Manager itself. The IOKit HID layer underneath definitely had changes though.

davidaramantSEP: Ah ok, so the time between opening the device and closing the device is around 30 seconds (and almost all that time is spent calling hid_read/write())? You said it often crashes when closing a device--is it actually crashing in hid_close()? Does that happen without any connection/reconnection going on? Would you have a stack trace?

davidaramantSEP commented 11 years ago

I think I made things more confusing by trying to simplify things :)

The 30 second timeout can also be ignored. The application has two modes where you either 'manually' scan for devices or it continually polls in the background (forever and ever and ever). Each manual attempt is capped at 30 seconds before we give up. We haven't gotten around to the auto-detect portion yet, so the 30 seconds happened to be on my mind.

The pseudo code I posted is for one connection attempt, which takes around 5-7 seconds. In the "manual" case we get around 5-6 attempts done; in the auto-detect case it is unbounded. We close all open USB devices at the end of each unsuccessful connection attempt. The user could plug-in or unplug a transport at any time, so we have to rescan to pick up any new ones.

There are probably over a hundred write/read calls during each connection attempt. The IR dongle is programmable; we send it scripts to run that in turn operate the IR emitter to search for devices. Each script we send has a response (even if it's "no data received") so theoretically we would never hit a read timeout in normal operation.

The crash is very frustrating. hid_close() actually returns fine (we have log statements after it that show up) but it crashes shortly thereafter. Something similar to the big snippet in my original post is always present in the stack trace when it crashes - references to callbacks and runloops. If we sleep for "a while" (lets say 100 ms) after each read/write/close, it doesn't crash (but our communication speed obviously tanks). This is why it seems like a threading issue to me.

mrpippy commented 11 years ago

Is your code doing something like:

hid_close()
print(statements that show up)
hid_open()
CRASH

And it crashes less often if you put a 50ms delay between hid_close() and hid_open()? Or does it crash after hid_close() without any other hid function being called?

davidaramantSEP commented 11 years ago

Sometimes it crashes after hid_close completes, sometimes it gets far enough to start hid_open, sometimes hid_open completes...

The sleeps are after close, read, and write. I'm not sure all three of those sleeps are needed; I haven't tested all the permutations of it.

mrpippy commented 11 years ago

So I hacked up hidtest last night to do roughly the same loop as your code:

while(1)
{
enumerate devices
open device
2000 write/read operations (60 byte packets)
close device
}

And it seemed to run ok. I'll do more testing tonight (including more disconnect testing), but I never had a segfault. I also ran hid_enumerate() in a loop and tested plugging/unplugging my device, and the number of devices detected was also correct.

I'm using 10.8.3 on a 2011 MacBook Air. What type/year Mac are you using? Also, do you have any other USB devices connected?

davidaramantSEP commented 11 years ago

Mid 2011 Mac Mini, i7 2.7GHz, 10.8.3. There's a mouse, keyboard, and a cheap hub that the transport is plugged into.

Another wrinkle I just realized is that our app has lots of threads all over the place. I've made sure that an opened handle is only accessed by one and only one thread, but I'm wondering if there is a problem with shutting down that thread right after hid_close completes. I'm still fuzzy on the run loop concept in Core Foundation.

I think all managed .NET threads turn into pthreads under the hood on Mono (on Windows there is not a 1:1 mapping between managed & native threads).

A closer match to our application would be to do everything inside that while loop in a new thread. Parts of our application is over 7 years old, and the threading design is a bit overzealous to put it mildly.

BTW, in hid.c, what does line 980 accomplish? IOHIDDeviceScheduleWithRunLoop(dev->device_handle, CFRunLoopGetMain(), kCFRunLoopDefaultMode); The previous line unschedules the device from the read thread run loop, but I don't get why it needs to be put back on the main one.

mrpippy commented 11 years ago

Ah ok, I'll try it with more threads.

As for hid.c line 980, I don't know why the device needs to be put onto the main runloop. The callbacks have just been unregistered, and there shouldn't be any more I/O going to the device.

Also, IOHIDDeviceClose() unregisters the callbacks itself, so I'm wondering if hid_close() could be simplified to look like this:

void HID_API_EXPORT hid_close(hid_device *dev)
{
    if (!dev)
        return;

    /* Close the OS handle to the device, but only if it's not
       been unplugged. If it's been unplugged, then calling
       IOHIDDeviceClose() will crash. */
    if (!dev->disconnected) {
        IOHIDDeviceClose(dev->device_handle, kIOHIDOptionsTypeSeizeDevice);
    }

    /* Cause read_thread() to stop. */
    dev->shutdown_thread = 1;

    /* Wake up the run thread's event loop so that the thread can exit. */
    CFRunLoopSourceSignal(dev->source);
    CFRunLoopWakeUp(dev->run_loop);

    /* Notify the read thread that it can shut down now. */
    pthread_barrier_wait(&dev->shutdown_barrier);

    /* Wait for read_thread() to end. */
    pthread_join(dev->thread, NULL);

    /* Clear out the queue of received reports. */
    pthread_mutex_lock(&dev->mutex);
    while (dev->input_reports) {
        return_data(dev, NULL, 0);
    }
    pthread_mutex_unlock(&dev->mutex);
    CFRelease(dev->device_handle);

    free_hid_device(dev);
}

davidaramantSEP commented 11 years ago

I tried your new close method without any of my hacks. It actually ran for a few minutes before crashing (this is much better than before). The crash is also much more interesting:

objc[70400]: Object 0x7b9056d0 of class __NSCFString autoreleased with no pool in place - just leaking - break on objc_autoreleaseNoPool() to debug
2013-04-09 18:21:05.958 Device Link[70400:234f] -[__NSCFType bytes]: unrecognized selector sent to instance 0x7b905630
2013-04-09 18:21:05.959 Device Link[70400:234f] An uncaught exception was raised
2013-04-09 18:21:05.959 Device Link[70400:234f] -[__NSCFType bytes]: unrecognized selector sent to instance 0x7b905630
2013-04-09 18:21:06.415 Device Link[70400:234f] (
    0   CoreFoundation                      0x96161e9b __raiseError + 219
    1   libobjc.A.dylib                     0x970e952e objc_exception_throw + 230
    2   CoreFoundation                      0x96165b0d -[NSObject(NSObject) doesNotRecognizeSelector:] + 253
    3   CoreFoundation                      0x960ade97 ___forwarding___ + 487
    4   CoreFoundation                      0x960adc42 _CF_forwarding_prep_0 + 50
    5   CoreFoundation                      0x9603d110 CFDataGetBytePtr + 80
    6   IOKit                               0x940e2516 __IOHIDDeviceReportCallbackRegistered + 29
    7   IOHIDLib                            0x09b1d80b _ZN16IOHIDDeviceClass25_hidReportHandlerCallbackEPviS0_ + 283
    8   IOHIDLib                            0x09b1fcf1 _ZN15IOHIDQueueClass24queueEventSourceCallbackEP12__CFMachPortP17mach_msg_header_tlPv + 43
    9   IOHIDLib                            0x09b1b941 _ZN16IOHIDDeviceClass19_cfmachPortCallbackEP12__CFMachPortP17mach_msg_header_tlPv + 77
    10  CoreFoundation                      0x9602847f __CFMachPortPerform + 303
    11  CoreFoundation                      0x96028335 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 53
    12  CoreFoundation                      0x96027f62 __CFRunLoopDoSource1 + 146
    13  CoreFoundation                      0x9605dcfa __CFRunLoopRun + 2154
    14  CoreFoundation                      0x9605d02a CFRunLoopRunSpecific + 378
    15  CoreFoundation                      0x9605ce9b CFRunLoopRunInMode + 123
    16  libUsbHid.dylib                     0x0919c2b2 read_thread + 290
    17  libsystem_c.dylib                   0x9599e5b7 _pthread_start + 344
    18  libsystem_c.dylib                   0x95988d4e thread_start + 34
)
2013-04-09 18:21:06.416 Device Link[70400:234f] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[__NSCFType bytes]: unrecognized selector sent to instance 0x7b905630'
*** Call stack at first throw:
(
    0   CoreFoundation                      0x96161e9b __raiseError + 219
    1   libobjc.A.dylib                     0x970e952e objc_exception_throw + 230
    2   CoreFoundation                      0x96165b0d -[NSObject(NSObject) doesNotRecognizeSelector:] + 253
    3   CoreFoundation                      0x960ade97 ___forwarding___ + 487
    4   CoreFoundation                      0x960adc42 _CF_forwarding_prep_0 + 50
    5   CoreFoundation                      0x9603d110 CFDataGetBytePtr + 80
    6   IOKit                               0x940e2516 __IOHIDDeviceReportCallbackRegistered + 29
    7   IOHIDLib                            0x09b1d80b _ZN16IOHIDDeviceClass25_hidReportHandlerCallbackEPviS0_ + 283
    8   IOHIDLib                            0x09b1fcf1 _ZN15IOHIDQueueClass24queueEventSourceCallbackEP12__CFMachPortP17mach_msg_header_tlPv + 43
    9   IOHIDLib                            0x09b1b941 _ZN16IOHIDDeviceClass19_cfmachPortCallbackEP12__CFMachPortP17mach_msg_header_tlPv + 77
    10  CoreFoundation                      0x9602847f __CFMachPortPerform + 303
    11  CoreFoundation                      0x96028335 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 53
    12  CoreFoundation                      0x96027f62 __CFRunLoopDoSource1 + 146
    13  CoreFoundation                      0x9605dcfa __CFRunLoopRun + 2154
    14  CoreFoundation                      0x9605d02a CFRunLoopRunSpecific + 378
    15  CoreFoundation                      0x9605ce9b CFRunLoopRunInMode + 123
    16  libUsbHid.dylib                     0x0919c2b2 read_thread + 290
    17  libsystem_c.dylib                   0x9599e5b7 _pthread_start + 344
    18  libsystem_c.dylib                   0x95988d4e thread_start + 34
)

mrpippy commented 11 years ago

Interesting, you can see that it crashed in read_thread() there. Do you know what was happening when it crashed? (I assume the device was open and it was reading/writing?)

davidaramantSEP commented 11 years ago

Unfortunately I didn't check the dev log on that crash before it got erased. It was executing our connection loop, so it was most likely reading/writing.

I haven't been able to reproduce that exact crash again. So far it bombs with much like it did before: 1 - 3 minutes of running the connection loop, usually inside of a hid_read or hid_close call.

The crash info starts with this:

Native stacktrace:

    0   Device Link                         0x00140154 mono_handle_native_sigsegv + 292
    1   Device Link                         0x0016bf85 mono_sigsegv_signal_handler + 197
    2   libsystem_c.dylib                   0x9598a8cb _sigtramp + 43
    3   ???                                 0xffffffff 0x0 + 4294967295
    4   IOKit                               0x940e2516 __IOHIDDeviceReportCallbackRegistered + 29
    5   IOHIDLib                            0x09c7980b _ZN16IOHIDDeviceClass25_hidReportHandlerCallbackEPviS0_ + 283
    6   IOHIDLib                            0x09c7bcf1 _ZN15IOHIDQueueClass24queueEventSourceCallbackEP12__CFMachPortP17mach_msg_header_tlPv + 43
    7   IOHIDLib                            0x09c77941 _ZN16IOHIDDeviceClass19_cfmachPortCallbackEP12__CFMachPortP17mach_msg_header_tlPv + 77
    8   CoreFoundation                      0x9602847f __CFMachPortPerform + 303
    9   CoreFoundation                      0x96028335 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 53
    10  CoreFoundation                      0x96027f62 __CFRunLoopDoSource1 + 146
    11  CoreFoundation                      0x9605dcfa __CFRunLoopRun + 2154
    12  CoreFoundation                      0x9605d02a CFRunLoopRunSpecific + 378
    13  CoreFoundation                      0x9605ce9b CFRunLoopRunInMode + 123
    14  libUsbHid.dylib                     0x091ec2b2 read_thread + 290
    15  libsystem_c.dylib                   0x9599e5b7 _pthread_start + 344
    16  libsystem_c.dylib                   0x95988d4e thread_start + 34

I can post the entire thing, but it's enormous since there are 18 threads active. None of them look particularly interesting except for a thread that has a lot of stuff about HID callbacks in it's stacktrace (you can see an example of a trace like that in my first post).

mrpippy commented 11 years ago

The main thing that I see in every one of your crash logs is __IOHIDDeviceReportCallbackRegistered + 29: __IOHIDDeviceReportCallbackRegistered() is the internal HID Manager function that calls the callback registered with IOHIDDeviceRegisterInputReportCallback().

I've looked closer at the IOHIDManager code, and I think there are race conditions that could cause this crash if input reports are received while a device is being closed, or possibly after an input report callback is unregistered. I'm gonna try to test this more tonight.

The HID Manager code gives me the impression that it's not designed to be thread-safe, i.e. calls to IOHIDDeviceClose() should not happen while a run loop (that could dispatch callbacks) is running. You said earlier that you access an opened device only on one thread, but does that mean you're doing hid_init() or hid_enumerate() on a different thread? Would it be possible to do all your hidapi calls from a single thread?

davidaramantSEP commented 11 years ago

Your investigation sounds promising!

I tried using a single thread for all API calls but I thought it had the same problem. I can try again, I suppose. Since we are supposed to support multiple transports, using a single thread for all communication would be a bit of a bottleneck.

Currently there is one thread that "owns" hid_init and hid_exit, one thread for enumeration, and each device gets a new thread for open/read/write/close.

mrpippy commented 11 years ago

I don't think that running hid_init() and hid_enumerate() on different threads is a good idea. hid_init() creates the IOHIDManager and schedules it on the current run loop (which is per-thread). hid_enumerate() later runs the current run loop quickly to process some events, but this won't work as intended if they're in separate threads.

I just hacked up hidtest to call hid_init() from a newly-created pthread, and it segfaults pretty quickly, either in hid_init() or later on. I'm not sure why, but I think it's because a raw pthread needs to have a run loop that's being run in order to work.

(I was trying to figure out exactly how mono/xamarin implements threads and what they do with run loops for user-created threads, but didn't see any info. I'll probably have to look at the source)

davidaramantSEP commented 11 years ago

I've re-confirmed that running all the API calls on the same thread still bombs with the same error. I'll also investigate what Mono does with threads.

davidaramantSEP commented 11 years ago

I found the code for the managed thread and some of the unmanaged code.

mrpippy commented 11 years ago

It looks like Mono's threads are built on top of GLib, which uses straight pthreads on the Mac. I'm not sure, but I doubt Mono is messing with the run loop at all for user-created threads.

Also it turns out I had some problems with my hidtest code, with those fixed running hid_init() from a separate thread works fine.

Does your device send input reports only in response to a report from the host, or does it send them all the time?

davidaramantSEP commented 11 years ago

The transport uses a strictly request-response protocol; the host is in charge of everything. There should be no unexpected reports from the transport.

So does your newly created pthread do anything with regards to run loops? Are you reading/writing on a separate thread as well?

If I do have to do something with run loops, is it enough to call CFRunLoopRun when the thread starts and CFRunLoopStop before it is shut down?

davidaramantSEP commented 11 years ago

Well, calling CFRunLoopStop before the thread stops seems to insta-kill the entire process without any type of error. CFRunLoopRun when it starts doesn't seem to fix anything either, but it does make the program hang when you unplug the transport.

signal11 commented 11 years ago

Hi Guys,

Sorry I've been unresponsive. Run Loops with respect to threading is not something I was ever able to find real clear authoritative documentation on. My understanding of how it works was mostly derived empirically, and also with the help of mrpippy back a couple years ago.

In reading some of the git log, I'm reminded that I came to the realization at one point that Apple intends for an application to have an IOHIDManager handle for each device. The way we currently do it is to have one IOHIDManager and then an IOHIDDevice for each device. I'm not sure that matters to this particular case, but my experience has been that when you don't do stuff exactly the way they want you to, you have problems (and it's not always clear what they want you to do).

I also wonder some about having a lot of threads. HIDAPI on mac should be threadsafe, but it doesn't mean there aren't bugs. hid_init() should be called one time at the start of execution, per the documentation, preferably from the main thread.

I'm also suspicious about the mono stuff. Is there a minimal test program which shows the symptoms? Maybe start with hidtest, and then build onto it (make it more complicated) until you see the bad behavior.

I also wonder, since you're calling hid_enumerate() from a lot of different threads, whether it needs to be mutexed, and then have the run loop moved to the current thread before process_pending_events() is called.

There's also an issue with dev->disconnected. When the device has been disconnected, it will full-up crash if certain functions are called, even if the handle is still open. The only way I know to get around this is to set dev->disconnected in the device removal callback, and then check for it all over the place. This seems racy, but I've yet to find a better way to do it. IIRC, mrpippy tested this pretty hard with open/read/write/close/unplug (in random order). Maybe that was someone else.

I hope maybe something I've said here will trigger a light bulb.

Alan.

mrpippy commented 11 years ago

davidaramantSEP: I resurrected a hidapi branch that used IOKit paths, give it a try and see if it helps. The main difference is that hid_open_path() doesn't call hid_enumerate(), so it might work better when hid_open() is called in a separate thread. The changes are all in mac/hid.c, you should be able to just swap it in to your code. (note that it doesn't have any of the hid_close() changes I suggested earlier) https://github.com/mrpippy/hidapi/tree/iokit_path

signal11: I think our use of IOHIDManager and IOHIDDevice is correct: one manager is used to enumerate devices, and an IOHIDDevice is created for each device. Our usage is a little different than Apple's intent, since we enumerate all devices and do VID/PID matching manually (rather than setting the matching criteria in the IOHIDManager), but it doesn't seem to be a problem. The documentation does leave out a lot of possible use cases though, which makes it hard to know which ones will work and which won't.

Also, HID Manager definitely has bugs! If you register an input report callback, then unregister it, it will segfault when the next input report comes in. I just added this as issue #116, but I think the only effect for hidapi is a possible race condition in hid_close()

signal11 commented 11 years ago

Also, HID Manager definitely has bugs! If you register an input report callback, then unregister it, it will segfault when the next input report comes in. I just added this as issue #116, but I think the only effect for hidapi is a possible race condition in hid_close()

Isn't that exactly what this issue is about? (well one of the things)

mrpippy commented 11 years ago

He said his device uses a request-response protocol, so there shouldn't be input reports coming in during hid_close(). Still, I suggested some hid_close() changes in https://github.com/signal11/hidapi/issues/114#issuecomment-16142620 and he said it lasted much longer without crashing, so I guess it is solving some problem!

davidaramantSEP commented 11 years ago

I tried the IOKit path hid.c with the new hid_close method you posted earlier. Unfortunately it still crashed with the __IOHIDDeviceReportCallbackRegistered + 29 error.

I'm going to dust off a console program that I was testing with earlier (still using Mono) and see if the behavior is any different. I think it worked better, but I don't know if I bothered leaving it running for very long.

In parallel to getting an actual fix, I was progressing with a series of hacks to work around the two problems we see but the workarounds stopped... working. I'll explain our terminology again: [Application] <- USB -> [Transport] <- IR -> [Device]

The transports are normally not very interesting to us; it's talking to the IR devices on the other end that represent 95% of our code base.

Not finding transport after unplugging/replugging: Lazily call hid_init before it's needed; use reference counting to call hid_exit when the last open USB device was closed. This used to work, but now it doesn't - the transport is never found by hid_enumerate after it's been replugged (identical behavior to not using this hack). Sleep thread after read/write/close to prevent crashing: This seems to work fine, but it utterly destroys communication speed. We've discovered that the sleeps are not needed after it finds an IR device and starts talking to it, so they are actually toggled off at runtime. There is a lot more processing that happens when talking to an actual IR device, so I doubt we send as many reports back and forth.

We can live with the second hack, but not be able to discover transports is pretty fatal.

davidaramantSEP commented 11 years ago

Correction: With the IOKit path patch, it does find the transport again, but the path returned is empty. Could that be because the USB device hasn't been fully closed yet?

mrpippy commented 11 years ago

In your latest tests, what thread are you calling hid_init() from, and what thread does hid_enumerate() get called from? hid_init() creates the IOHIDManager and attaches it to the current thread's run loop. If that thread's run loop never gets run again, the HID Manager's list of devices probably won't be updated correctly.

I think the path is empty because the HID Manager's device info is stale (because it's run loop isn't running), so the device being queried was already unplugged.

davidaramantSEP commented 11 years ago

You're right; the change I made to make sure Enumerate and Init happen on the same thread got lost in all the mess of applying/undoing hacks. When I redo that change, rediscovering a USB device works again without having to continually call exit/init (hooray).

signal11 commented 11 years ago

So what's the answer then? What needs to change in HIDAPI?

It sounds like you needed to call hid_enumerate() from the same thread as hid_init(). hid_enumerate() already calls process_pending_events(). Does the run loop need to get moved to the hid_enumerate() thread before the call to process_pending_events()? Does it then need to get moved back to the read_thread afterwards? In so doing, does it need to be mutexed with the read thread?

davidaramantSEP commented 11 years ago

I'm pretty sure that crash is still in there, so only one of the problems has been fixed so far. I'm currently running with the IOKit branch + new hid_close method.

Moving the run loop in hid_enumerate seems to make sense, although I'm not sure what would happen to hid_open which calls it internally. We use hid_open_path so we don't have to worry about it, but what would happen if you tried to open two devices at the same time from different threads?

I think the hid_init calls inside of hid_open_path and hid_enumerate complicate things as well. This seems like it would just introduce potentially weird behavior if you never call it explicitly and your program is as thread-happy as ours is - returning an error if it hasn't been called seems more appropriate.

I'm a bit confused by what you mean by the read thread. There should only be a read thread if a device is open, right? I don't know if scanning for devices is supposed to be supported when there is currently a handle open; we never do that either.

signal11 commented 11 years ago

You guys have mentioned code changes you've made. Are there any patches?

davidaramantSEP commented 11 years ago

I haven't made any. I just used hid.c from the IOKit branch mentioned in this thread with the new close method posted above.

We got HIDAPI into a usable state using those two changes so I've had to focus on other parts of the application. I did find out that the sleep after hid_read_timeout was not needed, but we still need a 50 ms sleep after hid_write and hid_close. However, this is only needed when we are looking for devices; once we find one, we can toggle the sleeps off without any crashes occurring.

I'm not very happy about having that hack in there and not getting to the bottom of the problem, but unfortunately we have to be a bit pragmatic about where we spend time in order to meet our deadline.

signal11 commented 11 years ago

For the hid_close() modifications, I see you took out the unregistering of the report callbacks. Everything that is in there was put in there for a reason, because it was necessary on some version or other of OSX (maybe those reasons are gone on newer versions though)). I seem to remember that when porting to Lion (and I use the term "port" because getting this to run on Lion was a disaster) the input report callback needed to be explicitly unregistered, or else it would keep getting called even after IOHIDDeviceClose(). Maybe that's not necessary anymore.

nyholku commented 8 years ago

Three years later... any news on this?

In my HIDAPI clone (PureJavaHidApi) I'm experiencing something similar on Yosemite. My lib is Java but the code logic was lifted from HIDAPI and is almost identical.

If I open a device and unplug/replug it device, then I cannot open it again.

I traced this to the fact that after unplugging IOHIDManagerCopyDevices still returns a list where the device is present even if it has been unplugged. After re-plugging the list then contains to instances of that device.

When the list is then searched the wrong ('old') device is found and naturally opening that fails.

So somehow the device is not properly closed (in my code) but left dangling. I've tried various ways to close the device in the removal call back but without success.

As a work around I changed my open code to use the last matching entry in the list returned by IOHIDManagerCopyDevices, this improves things but unfortunately sometimes the latest plugged in incarnation of the device is not the last in the list.

Anyway, posting this here just in case this issue is still unsolved or someone else can find some use for this titbit of debugging information. And also fishing for solution to my problem of course ;)

nyholku commented 8 years ago

Posting this here also for completeness, i.e. while my previous post was about PureJavaHidApi and not signal11/hidapi I'm posting the solution here for posterity in case someone ends up looking for solution to this kind of behaviour.

After spending all spare time for six days finally nailed it to this:

I did not call

IOHIDManagerSetDeviceMatching(hidmgr, null);

before (in device enumeration) calling:

IOHIDManagerCopyDevices(hidmgr);

What an insidious feature in IOKit, without that call everything seems to work, except if you unplug a device then something is left dangling inside IOKit and IOHIDManagerCopyDevices will return the unplugged device among other devices.

br Kusti

signal11 / hidapi

OS X: Unable to find previously closed device / random crashes when closing device #114