Potential race condition in lo_server_next_event_delay ?

radarsat1 / liblo

liblo is an implementation of the Open Sound Control protocol for POSIX systems

GNU Lesser General Public License v2.1

189 stars 60 forks source link

Potential race condition in lo_server_next_event_delay ? #69

Closed tuscland closed 5 years ago

tuscland commented 5 years ago

Hello,

I am using a fairly recent code base (2 commits behind master). A user using TouchDesigner on a Mac Pro reported a crash which seems related to liblo. Looking at the crash, I infer that there is a race condition, though I may be wrong.

The crash is a segmentation fault in lo_server_next_event_delay:

Exception Type:  SIGSEGV
Exception Codes: SEGV_MAPERR at 0x0

Thread 7 Crashed:
0    0x00000001049b814d lo_server_next_event_delay (server.c:1876)
1    0x00000001049b80cd lo_servers_wait (server.c:1448)
2    0x00000001049b7d50 lo_server_wait (server.c:1350)

Looking at server.c at line 1876:

    if (s->queued) {
        lo_timetag now;
        double delay;

        lo_timetag_now(&now);
->      delay = lo_timetag_diff(((queued_msg_list *) s->queued)->ts, now);

... while another thread runs dispatch_method.

My interpretation is that something is null. The server is running so it can't be s. If it was s->queued->ts, the crash would occur in lo_timetag_diff. So it must be s->queued, but the if statement before checks for this condition, so something must change between the condition and the crash location.

I can see dispatch_queued modifies s->queued while lo_servers_wait read it (using lo_server_next_event_delay).

I am not sure if this is the problem and how to solve it. Any idea?

Thanks for your help!

radarsat1 commented 5 years ago

Interesting, thanks for reporting. Maybe it's too late but it isn't possible to examine a memory dump? I would use gdb here.. i'll have to see if it's possible to reproduce.

I'm not sure why dispach_method would be called on a separate thread tbh. lo_server_wait is called by lo_servers_recv_noblock, which calls lo_server_recv after waiting, which calls dispatch_data.

tuscland commented 5 years ago

Maybe this is how I use liblo?

I have a thread that loops forever testing for incoming data using lo_server_wait:

        while (_shouldRun)
        {
            if (lo_server_wait(server, 50) > 0)
            {
                CFRunLoopSourceSignal(runLoopSource);
                CFRunLoopWakeUp(runLoop);
            }
        }

When the runloop is signalled, it schedules a call in the main thread, basically doing this:

    do
    {
        result = lo_server_recv_noblock(server, 0);
       // handle error
    }
    while (result > 0);

Am I doing something wrong?

Best, Cam

radarsat1 commented 5 years ago

Ah. Well, I can see that dispatch_data, eventually called by lo_server_recv_noblock, does of course mess with the s->queued pointer, so it's possibly your loop is calling into lo_server_wait before dispatch_data is done with it. Perhaps you should use a mutex instead of a signal? Just make sure lo_server_wait isn't called before the message is handled.

There is not really a lot of benefit of waiting in a separate thread if you're going to dispatch them somewhere else. Usually if I don't want to deal with the fact that liblo threads execute the methods on a separate thread, I just stick lo_server_recv_noblock in my main loop. Incoming messages should just pile up and get dispatched by lo_server_recv_noblock anyways, and if there is nothing there, it will not block. So the lo_server_wait serves no purpose I can see.

tuscland commented 5 years ago

Thank you for the pointers.

On macOS (CoreFoundation actually), runloop sources (ways to integrate with a run loop, more specifically here the main loop) are kind of lazy. The OS schedules calls to the loop in the most efficient manner. So, I am not guaranteed that lo_server_recv_noblock will be called as soon as there is something on the socket, which cause unwanted delays.

I could try the mutex, but I remember the performance was not so great, obviously.

Also, I think I already tried to register the lo_server's socket with the main runloop source, but had problems of socket configuration compatibility between liblo and CoreFoundation.

Shall I close this issue, or is it still a valid use-case?

radarsat1 commented 5 years ago

I think there must be a way to do this correctly on OS X, I would like to use this issue to discuss it. It's pretty normal for software to use select or poll to wait on a socket. It sounds like,

Also, I think I already tried to register the lo_server's socket with the main runloop source, but had problems of socket configuration compatibility between liblo and CoreFoundation.

is the real problem here. Can you describe this better? The sockets used by liblo are just normal fds created with socket.

By the way your other option is to just be sure to perform the dispatch in the wait thread using lo_server_recv and use signals within the method handler to trigger activity in the main loop -- that's how I would normally do it I suppose.

tuscland commented 5 years ago

Hey Steve,

I made a proof of concept using CFSocket to wrap the underlying liblo server socket using Apple's APIs. It looked like all was dandy but some testing revealed it was not this couldn't possibly work with TCP: liblo manages it's own set of server sockets. For example, when lo_send_message_from is used, it automatically creates a new socket that is adds to the server's list of managed sockets. We can't monitor the sockets created this way using the CFSocket API.

This begs the question of wether lo_server_get_socket_fd is any useful... maybe when using UDP.

I went on digging into my code repository to find an older revision of the server code that featured a mutex for threads coordination (embarrassing).

I will try your suggestion of dispatching to the main thread after a call to lo_server_recv. It seems I can even use the new lo_server_thread to do the heavy lifting.

Thank you for your help!

tuscland commented 5 years ago

Just wanted to add a note, you said:

By the way your other option is to just be sure to perform the dispatch in the wait thread using lo_server_recv and use signals within the method handler to trigger activity in the main loop -- that's how I would normally do it I suppose.

The problem I see with this approach is that this requires all modifications to the lo_server (e.g. add / remove methods) to happen in the server thread, thus adding complexity. If you do the dispatch in the main thread, everything can happen in the main thread.