unispeech / unimrcp

Open source cross-platform implementation of MRCP protocol
http://www.unimrcp.org
Apache License 2.0
379 stars 167 forks source link

ntp issue #257

Closed Mert84 closed 4 years ago

Mert84 commented 5 years ago

We are having an issue with ntp with a bad/slow ntp time source. An ntp time source is making us automatically adjust the machine-time several times a day and causes several seconds of time shifts each time. We do not want our server machines' time to be changing too much so we are working on ntp side. But I also want to also make sure we are doing all the correct things on our app.

The issue with unimrcp is that when the time of the server machine is adjusted to several seconds to the past, unimrcpserver stops processing the messages until the system clock reaches the latest observed time(again!), after this time is reached processing continues. This may cause several seconds of hangs.

I have checked unimrcp code and saw that in file apt_poller_task.c :function apt_poller_task_run(...) calls apt_timer_queue_advance(...) based on the values returned from apr_time_now() which happens to be the system time. This seems to cause this behavior.

Normally in C++ there is a new monotonically increasing clock named std::steady_clock. But I do not know the C way of fixing this. What would your recommendation be?

Would the following else condition be a proper workaround:

if(timeout != -1) {
    time_now = apr_time_now();
    if(time_now > time_last) {
        apt_timer_queue_advance(task->timer_queue,(apr_uint32_t)((time_now - time_last)/1000));
    }
    else
    {
        time_last = time_now;
    }
}

Regards Mert Büyüktuncay

achaloyan commented 5 years ago

Hi Mert,

I agree with your analysis.

For an application such as UniMRCP server, it is quite important to avoid clock drifts, as it may cause issues in processing of RTP streams as well.

The other question is how to make the software more tolerant to such problems, if they exist. Quickly looking through the code, I think the change you suggested would improve the behavior, and I do not see any drawbacks so far.

Regards

fatihkiralioglu commented 5 years ago

Hi Arsen, Do you expect us to send any log file or directly send a pull request? Best Regards.

achaloyan commented 5 years ago

By having a second look at the suggested change, while I still do not see any regression, the change does not seem to allow for any improvement either. Please note that time_last will be set/overridden on next around in the loop, anyway. Feel free to correct me, if I am missing anything obvious here being tired at the end of the day.

Otherwise, I think the attached patch would properly address the issue. ntp-clock-drift-patch.zip

Mert84 commented 4 years ago

Wow that's right, I have missed that. I think your patch makes more sense.

I am just back from vacation. I will have a little more time in the coming days.I think the best way for me to understand what's going on here will be by debugging this.

By the way with a quick search I had found the following uses of apr_time_now(): libs\apr-toolkit\src\apt_consumer_task.c(154): time_now = apr_time_now(); libs\apr-toolkit\src\apt_poller_task.c(272): time_now = apr_time_now(); libs\mpf\src\mpf_scheduler.c(207): time_now = apr_time_now(); libs\mpf\src\mpf_scheduler.c(227): time_now = apr_time_now();

The patch fixes apt_consumer_task and apt_poller_task. About the mpf_scheduler: As far as I understand from timer_thread_proc, mpf_scheduler tries to stabilize the time_drift here by adjusting the time to sleep. But it still depends on the system time as the reference time. So I guess we can also make an improvement on that part. Or am I misreading something there too?

Mert

achaloyan commented 4 years ago

I think there is not too much we can do in the mpf_scheduler as far as clock drifts are concerned. Everything should be properly handled as is.

BTW, this particular routine is used for Linux only. If I am not mistaken, you are on Windows.

Mert84 commented 4 years ago

All of our customers are currently on Windows, but we are moving our deployments to kubernetes and we are ditching Windows during that process. So both platforms have become important for us.

I've read mpf_scheduler. I think when the system_time is adjusted this will either stop sending rtp for some time or will try send too many frames in short time(depending on the clock, if it is adjusted to the future or to the past)

I think we may prevent this by sanitizing the clock in mpf_scheduler. So the following code

    time_now = apr_time_now();
    time_drift += time_now - time_last - timeout; 

becomes

    apr_interval_time_t time_passed;
    apr_interval_time_t resolution_msec = scheduler->resolution * 1000;
    apr_interval_time_t max_clock_error_threshold = resolution_msec * 10;

.....

    time_now = apr_time_now();
    time_passed= time_now - time_last;
    if(time_passed< 0)
    {
        time_passed= resolution_msec;
    }
    else if(time_passed>= max_clock_error_threshold )
    {
        time_passed= max_clock_error_threshold ; //or should I prefer resolution_msec for this sanitization? 
    }

    time_drift += time_passed- timeout; 

Edit: changed variable naming convention in the sample.

Mert84 commented 4 years ago

Hi Arsen, I've been debugging this for the last two days. As far as I understand clock drift may have caused some delays in the timeout durations, but this has not led to problems in the successful/fast operations. The patch also seems good for better handling of this issue.

Since I could not change my system time on the company machine (due to some policy at company management software). I've performed my observations by renaming apr_time_now to apr_timenow in apt_poller_task.c and introducing drifts myself at certain points. If I can disable the policies in the future I may also test by changing my system time and see if this changes anything.

Regards

achaloyan commented 4 years ago

Hi Mert,

Thanks for your input on this issue. The patch I provided should not introduce any regression at least. The problem is the change is in one of the core routines and I do not feel comfortable applying the change right now, but will consider revisiting this task before producing a new UniMRCP release, which is postponed and now scheduled for Q1, 2020.

Regards