monome / norns

norns is many sound instruments.
http://monome.org
GNU General Public License v3.0
633 stars 147 forks source link

sclang crash on system time sync #399

Open tehn opened 6 years ago

tehn commented 6 years ago

sclang crashes when time/date change by a substantial amount

samdoshi commented 6 years ago

What are you using to set the time on network connection? systemd-timesyncd or a full NTP client?

Does the hardware have an RTC?

tehn commented 6 years ago

the hardware does not have an RTC. it didn't occur to me that this would ever be an issue unfortunately.

samdoshi commented 6 years ago

Okay. Apparently you can use fake-hwclock to stop the time coming back up as the UNIX epoch (1st Jan 1970), instead it will come up with the time set the same as when Norns was last powered down.

Something like chrony (an NTP client) will slew the clock rather than step it, but that's more designed to handle differences of a few seconds.

I guess the question is what counts as a "substantial amount"? If the issue is that it crashes when going from 1st Jan 1970 to the current date, but not if jump is only a few weeks/months then fake-hwclock might be good enough.

Otherwise someone will need to fire up gdb and figure out why sclang is crashing.

This might be handy for setting up a reproducible crash on a desktop machine:

https://serverfault.com/questions/138325/faking-the-date-for-a-specific-shell-session

catfact commented 6 years ago

@artfwo has already done a lot of work on this including with fake-hwclock. He's out for the week at Linux audio conference. Maybe will have more to add. oddly, spoofing the clock doesn't crash sclang (necessarily) but does mess up various ugens in various ways

samdoshi commented 6 years ago

Bother. Tried using libfaketime to crash sclang:

$  FAKETIME_STOP_AFTER_SECONDS=10 faketime "1970-01-01" sclang 

(skipped some output)

*** Welcome to SuperCollider 3.9.3. *** For help type ctrl-c ctrl-h (Emacs) or :SChelp (vim) or ctrl-U (sced/gedit).
qt5ct: palette support is disabled
sc3> thisThread.seconds
-> 7.212063747
sc3> thisThread.seconds
-> 1528231717.044
sc3> 
Exiting sclang (ctrl-D)
sc3> main: waiting for input thread to join...
main: quitting...
cleaning up OSC

The 2 calls to thisThread.seconds are a few seconds apart. No crash.

catfact commented 6 years ago

did you have a server running though? the crash is in scsynth, not sclang

[now seems like i was wrong about that, sorry]

catfact commented 6 years ago

and it might be related to other stuff we're doing, like grabbing bus values with the shared memory interface

samdoshi commented 6 years ago

I did try booting the server, but not in depth. I've edited the bug title to say scsynth instead of sclang.

Just tried on my desktop again (no Norns yet).

$  FAKETIME_STOP_AFTER_SECONDS=30 faketime "1970-01-01" sclang

(skipped output)

sc3> thisThread.seconds;
-> 6.12997873
sc3> s.boot;
-> localhost
sc3> booting server 'localhost' on address: 127.0.0.1:57110

(skipped output)

sc3> thisThread.seconds;
-> 15.170008094
sc3> x = {SinOsc.ar}.play;
-> Synth('temp__0' : 1000)
sc3> thisThread.seconds;
-> 19.714078338
sc3> thisThread.seconds;
-> 1528310138.1781
sc3> x.free;
-> Synth('temp__0' : 1000)
sc3> x = {SinOsc.ar}.play;
-> Synth('temp__1' : 1001)
sc3> x.free;
-> Synth('temp__1' : 1001)
sc3> thisThread.seconds;
-> 1528310155.1541
sc3> 
Exiting sclang (ctrl-D)
sc3> main: waiting for input thread to join...
main: quitting...
server 'localhost' disconnected shared memory interface
'/quit' sent

cleaning up OSC

No crash with a SinOsc playing over the time jump.

I think someone that can observe a crash will need to try and minimise it. If you've got libfaketime available that should (hopefully) let you automate a crash. That would be my first step.

edit: I probably should say I can't guarantee that libfaketime is running properly inside the scsynth process that's started by sclang, I could start an external server, but I think it's easier for someone that can observe the existing crash to take over.

artfwo commented 6 years ago

this is reproducible only in part, even when you change the actual system time using date -s. i do not experience crashes, but the engine breaks in weird ways instead: doesn't react to commands, doesn't correctly report polls, etc. as @catfact described above.

and yeah, raspbian is already using fake-hwclock+systemd-timesyncd out of the box. we haven't touched those.

catfact commented 6 years ago

updated title since were not actually sure what the crash mechanism is, only that it can be (very sporadically) reproduced when enabling wifi

artfwo commented 6 years ago

it's related to timesync, this can be verified by watching syslog when the crash happens. it's also clear that the sporadicity of the bug decreases if you hold the box switched off for significant (hours) amount of time.

simonvanderveldt commented 6 years ago

Can confirm it's related to the time change, can reproduce this way:

# make sure you've synced time once, so systemd-timesyncd has a timestamp stored
sudo systemctl stop systemd-timesyncd.service
sudo date -s '2018-05-24 08:00:00'
sudo systemctl restart norns-crone.service
sudo systemctl start systemd-timesyncd.service
# now run top and see sclang take 100% CPU

This also happens why I just run sclang without the ws-wrapper, so it seems it is caused by SC and not by our code.

The odd thing is that manually changing the date doesn't seem to trigger this issue. I'm not sure why this yet. Maybe it's related to how systemd-timesync works/changes the time? There is mention of gradually adjusting the time here https://www.freedesktop.org/software/systemd/man/systemd-timesyncd.service.html

This minimalistic service will set the system clock for large offsets or slowly adjust it for smaller deltas Maybe that doesn't work too well with sclang?

One other odd thing, I've ran into the following a couple of times:

Warning: systemd-timesyncd.service changed on disk. Run 'systemctl daemon-reload

Not sure what's causing it. I did a diff and the unit that's loaded is no different than the unit on disk :confused:

samdoshi commented 6 years ago

Apparently systemd-timesyncd will perform a similar function to fake-hwclock

It also saves the local time to disk every time the clock has been synchronized and uses this to possibly advance the system realtime clock on subsequent reboots to ensure it monotonically advances even if the system lacks a battery-buffered RTC chip.

https://www.freedesktop.org/software/systemd/man/systemd-timesyncd.service.html

I'm wondering if there is some weird interplay between them.

simonvanderveldt commented 6 years ago

So, it seems this is a bug in supercollider where if the system time changes a large enough amount it'll consume 100% CPU for a longer period which will eventually result in a kernel panic.

I've been able to replicate this on my laptop as well, so it's not related to ARM/the CM3. How to replicate on norns:

sudo systemctl stop systemd-timesyncd
sudo date -s '2018-06-01 00:00:00'
sudo systemctl restart norns-crone
sudo date -s '2018-06-01 04:00:00'

When using smaller time delta's the following happens:

I'll file a bug for SuperCollider for this.

I think for now the simplest way to address this is to disable timesync/systemd-timesyncd. This does mean we won't have timesync of course, but I don't think the system time is used in many places. @artfwo mentioned system time is used for naming recordings, we should probably change that to use something else like an incrementing number for this.

catfact commented 6 years ago

forgive my ignorance, but would that be the same system timed used by posix clock_gettime()?

we use that in metronomes: https://github.com/monome/norns/blob/master/matron/src/metro.c#L218

we also use gettimeofday() for lua timing https://github.com/monome/norns/blob/master/matron/src/weaver.c#L1127

simonvanderveldt commented 6 years ago

@catfact Someone who knows C should probably answer that, but I think it sometimes is, depending on the first argument passed to it, see https://stackoverflow.com/a/12480485

catfact commented 6 years ago

thanks for checking. Ok, looks like metros should be fine (CLOCK_MONOTONIC) and get_time() will need adjustment (wall clock)

scztt commented 6 years ago

Can someone who has repro'd attach an actual crash stack to this issue?

catfact commented 6 years ago

i'll keep trying but haven't been able to get a nice stack trace or even a proper crash actually.

what i get is arbitrarily long hangs where the server is just totally unresponsive. sclang output (in syslog since we're running it with systemd) is silent except when s_new, &c is failing. nothing in kern.log or /var/crash.

next will try restructuring things to work with a remote server and launching scsynth explicitly, maybe can catch more of its output.

simonvanderveldt commented 6 years ago

@scztt @catfact the steps I posted above consistently cause a kernel panic for me on norns. Is that not the case for you?

catfact commented 6 years ago

no, not getting kernel panics, just hangs. like last time i tried, it was running the awake script - and i got

maybe i should take a snapshot of my filesystem since i haven't actually been using the update/sync process.

:/

tehn commented 6 years ago

i've seen this jam-up of events happen typically with WIFI activation or attempted activation.

this also happened with the old update method, where a long (20 seconds) os.execute was run, effectively stalling lua execution, and then all the metros would catch up and trigger at once. (os.execute should be run in a coroutine or the script should fork immediately, but the update routine has changed so this isn't an issue any longer)

tehn commented 6 years ago

@simonvanderveldt @artfwo is there a strong reason not to simply disable systemd-timesyncd so we can move past this in the short term?

simonvanderveldt commented 6 years ago

@simonvanderveldt @artfwo is there a strong reason not to simply disable systemd-timesyncd so we can move past this in the short term?

@tehn I've been running with it disable for the last two weeks. Haven't noticed any issues because of the time being out of sync. There might be some funky issues eventually, mainly looking at security/encryption related things which can depend on some form of a somewhat sane time. But for example installing deb packages worked fine.

@artfwo I believe you were mainly concerned about stuff like log ration, right?

ngwese commented 6 years ago

it would be worth testing the usb sync feature - iirc the rsync command being used is trying to do deltas in order to minimize the amount of data being synced. what i’m not sure of it whether rsync is using file modification timestamps as part of its selection criteria.

scztt commented 6 years ago

(Non-local) OSC bundles require time sync to work - if you ever have an external device sending timing-sensitive OSC messages (or, for that matter, any OSC bundles that have a timestamp at all, whether or not you really care about timing), it'll be broken if your clocks aren't synced. Maybe not an immediate need, but definitely something to be aware of.

tehn commented 6 years ago

usb sync should be fine given fake-hwclock is running?

good point re: OSC bundles

artfwo commented 6 years ago

perhaps we should figure out if supernova works on norns (or not). this will contribute data to making a decision about disabling timesync.

TheTechnobear commented 6 years ago

time is pretty vital for network activity, e.g. things like git pull will fail too due to invalid certs

could a simple solution be to

(or some variation on the above, trick here is to take manual control of when the clock is changing when a network is started vs crone/sc)

the only disadvantage of this approach , is net time will drift, but that's probably not a big issue. (you could actually start timesyncd, once you have net time, as then you know the time changes are small, due to drift)

p.s. ive not seen this on a rPI3 (sorry don't have a norns to test) , so id assuming some of this is perhaps dependent upon start up order/timing (e.g. how fast do you get your ntp response, is it before SC starts)

tehn commented 6 years ago

we were hoping switching to supernova would fix this problem but it seems to have the same symptom.

many devs have vocalized preference to keep timesyncd

i propose:

simonvanderveldt commented 6 years ago

we were hoping switching to supernova would fix this problem but it seems to have the same symptom.

Did someone test this already? I didn't see any comments about it in #464 and #435 Based on looking at the code it should not have the same issues since it uses a different timer object, but maybe there's something else causing issues.

tehn commented 6 years ago

yes, i've been testing. but further testing would certainly be appreciated, in case something was missed.

catfact commented 6 years ago

i want to seriously rule out the possibility that this could actually be a problem with nanomsg websockets. i've been trying pretty hard all weekend and have yet to encounter it without using them.

my test is to changestart.sh to use direct calls to matron and sclang. of course this means maiden doesn't work, but everything else does.

doing this, i've had no errors so far when turning on wifi or connecting an SSH session (which are the usual triggers for me, in addition to connecting to maiden.)

even if sclang isn't running under the WS wrapper, if nanomsg is behaving badly with timesync and blocking up ip_recv or other bits of the UDP stack, the consequences are far-ranging and will certainly mess up normal sclang/supernova operations.

simonvanderveldt commented 6 years ago

@catfact did you test this with the supernova build or with the old/existing scsynth build?

catfact commented 6 years ago

with supernova

part of my suspicion is that i just had to spend some time last week tearing out nanomsg from another project, since we were getting very occasional segfaults from it (reported on hockey.) so now i'm mad

catfact commented 6 years ago

more data points:

antonhornquist commented 6 years ago

allow matron to start/stop crone

it's only the sc server (scsynth or supernova) crashing, right? if so, won't eval of s.quit and s.boot in SCLang do, rather than a complete restart of crone?

EDIT: and perhaps thisProcess.recompile

simonvanderveldt commented 6 years ago

@catfact Ah, that's a good point, man clock_gettime says:

CLOCK_MONOTONIC
  Clock  that cannot be set and represents monotonic time since some unspecified starting point.  This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.

scsynth uses std::chrono::system_clock::now() https://github.com/supercollider/supercollider/blob/Version-3.9.3/server/scsynth/SC_Time.hpp#L39

For supernova I'm not sure tbh, seems to be all over the place

~/s/t/s/s/supernova:develop ◼ ❯ grep -R -i 'clock' .                          
./utilities/high_resolution_timer.hpp:        int status = clock_gettime(CLOCK_REALTIME, &start_time);
./utilities/high_resolution_timer.hpp:        int status = clock_gettime(CLOCK_REALTIME, &now);
./audio_backend/jack_backend.hpp:        engine_functor::log_("Jack: xrun detected - resyncing clock\n");
./audio_backend/jack_backend.hpp:            engine_functor::sync_clock();
./audio_backend/portaudio_backend.hpp:            engine_functor::sync_clock();
./audio_backend/portaudio_backend.hpp:            engine_functor::sync_clock();
./server/server_args.cpp:        ("use-system-clock,C", value<uint16_t>(&use_system_clock)->default_value(0), "type of clock sampleclock=0 systemclock=1")
./server/server_args.hpp:    uint16_t use_system_clock;
./server/main.cpp:    auto start_time = std::chrono::high_resolution_clock::now();
./server/main.cpp:    auto end_time = std::chrono::high_resolution_clock::now();
./server/server.cpp:    use_system_clock = (args.use_system_clock == 1);
./server/server.cpp:    if(instance->use_system_clock){
./server/server.cpp:        double nows = (uint64)(OSCTime(std::chrono::system_clock::now())) * kOSCtoSecs;
./server/nrt_synthesis.hpp:        auto start_time = steady_clock::now();
./server/nrt_synthesis.hpp:        auto end_time = steady_clock::now();
./server/server.hpp:    static inline void sync_clock(void);
./server/server.hpp:    bool use_system_clock;
./server/server.hpp:inline void realtime_engine_functor::sync_clock(void)
./server/server.hpp:    if(instance->use_system_clock){
./server/server.hpp:        double nows = (uint64)(OSCTime(std::chrono::system_clock::now())) * kOSCtoSecs;
./server/server.hpp:    if(instance->use_system_clock){
./server/server.hpp:        //time_tag nows = time_tag::from_ptime(boost::date_time::microsec_clock<boost::posix_time::ptime>::universal_time());
./server/server.hpp:        double nows = (uint64)(OSCTime(std::chrono::system_clock::now())) * kOSCtoSecs;
./dsp_thread_queue/dsp_thread_queue.hpp:            auto start = high_resolution_clock::now();
./dsp_thread_queue/dsp_thread_queue.hpp:            auto end  = high_resolution_clock::now();
./sc/sc_osc_handler.hpp:#include <boost/date_time/microsec_time_clock.hpp>
./sc/sc_osc_handler.hpp:        now = time_tag::from_ptime(boost::date_time::microsec_clock<boost::posix_time::ptime>::universal_time());
./sc/sc_plugin_interface.cpp:        auto now = std::chrono::high_resolution_clock::now();
catfact commented 6 years ago

it's only the sc server (scsynth or supernova) crashing, right?

this is the theory. but i'm not 100% convinced and i don't think it's wise to make assumptions and promises that we can fix this by messing with the scsynth/supernova lifecycle... i find weird slowdowns and high usage from both processes under websockets. (and timers in matron could also get screwy, since CLOCK_MONOTONIC is not as immune to NTP jumps as i had assumed.) but yes:

if so, won't a s.quit and s.boot do, instead of a complete restart of crone?

audio.restart() in lua will do this (by forcing a recompile of the interpreter, which also quits and relaunches the server process.) of course if sclang is totally unresponsive then both SC processes need to be forcibly stopped.

@simonvanderveldt hmm ok thanks. so looks to me like:

sync_clock() either calls std::chrono::system_clock::now() (with -C 1 argument) or just keeps its own internal timer (with -C 0, the default)

the fact that issues seem persistent with the "sample clock" mode makes me think this is not the real problem.

tehn commented 6 years ago

running top shows sclang at 100%

On Mon, Aug 13, 2018 at 1:37 PM ezra buchla notifications@github.com wrote:

it's only the sc server (scsynth or supernova) crashing, right?

this is the theory. but i'm not 100% convinced. i find weird slowdowns and high usage from both processes. but yes:

if so, won't a s.quit and s.boot do, instead of a complete restart of crone?

audio.restart() in lua will do this (by forcing a recompile of the interpreter, which also quits and relaunches the server process.)

@simonvanderveldt https://github.com/simonvanderveldt hmm ok thanks. so looks to me like:

  • supernova "sample clock" vs "system clock" argument determines how supernova syncs clock in the realtime callbacks, e.g. in jack_backend.hpp

    int perform(jack_nframes_t frames) { if (unlikely(!time_is_synced)) { engine_functor::sync_clock(); time_is_synced = true; }

sync_clock() either calls std::chrono::system_clock::now() (with -C 1 argument) or just keeps its own internal timer (with -C 0, the default)

the fact that issues seem persistent with the "sample clock" mode makes me think this is not the real problem.

in any case, this doesn't affect the way the OSC subsystem determines time tags. that's thia gnarliness:

ime_tag::from_ptime(boost::date_time::microsec_clock::universal_time());

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/monome/norns/issues/399#issuecomment-412601459, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPEcD7oCBd0mZYKzP9MfTssPlZmQHmoks5uQblsgaJpZM4Ua3Gs .

catfact commented 6 years ago

right... i'm saying, that doesn't sound like a problem with supernova or scsynth. it sounds like either 1) something really borked with OSC timestamps (sclang/scynth require keepalive OSC traffic to keep functioning) or 2) something related to the old "virtual terminal" bug (https://github.com/supercollider/supercollider/issues/2144)

i just wanna be really clear about what i'm seeing:

this seems pretty obviously a problem with sclang getting stuck somewhere. presumably with OSC, but i'm not totally sure it isn't actually in a socket-reading loop, somehow (e.g. nanomsg getting confused about the state of the websocket.)

simonvanderveldt commented 6 years ago

if i change crone.sh and matron.sh to run matron and sclang without ws-wrapper, i don't get this. it would be super cool if anyone could try the exact same thing. i am going to keep trying a little more carefully, waiting longer with the power off to force a large jump in NTP.

@catfact you should be able to reproduce this using the steps I wrote down here https://github.com/monome/norns/issues/399#issuecomment-395967939 Then you don't have to waste time waiting between poweroff and poweron :)

I'll give it a try this week as well. The only thing you've done is remove ws-wrapper from the two scripts?

catfact commented 6 years ago

yes. though reading back it sounds like maybe you were able to reproduce on non-ARM system just with plain sclang (?)

in any case thanks for the reminder, i'll try that

catfact commented 6 years ago

update: ok with a big enough delta i still get the problem. damn

i guess i can implement the workaround described above. (when enabling wifi, first stop SC, then manually run timesync, then start SC.)

simonvanderveldt commented 6 years ago

yes. though reading back it sounds like maybe you were able to reproduce on non-ARM system just with plain sclang (?)

Correct, I was able to reproduce the issue on my desktop using just supercollider (though I don't remember if it was sclang or scsynth that went to 100% CPU)

scztt commented 6 years ago

If anyone is building scsynth for norns and wants to test - a simple fix on the scsynth side (for the 100% cpu problem at least, and possibly the crashes) would be to add a check near here to detect cases where nextTime is extraordinarily out-of-alignment with the expected next time (meaning a sequence of times like 11, 12, 13, 14, 56 ...). If that case occurs, simply report a warning and call mScheduler.Empty(); to clear the scheduler. You'll lose events, but only events that are scheduled at what is probably an incoherent time anyway.

A corresponding check could be made when adding events to mScheduler - simply throw them away and warn if they are drastically out of sync with the current time (e.g. minutes before or after - though after is tough, because the SC event system will schedule node free or \gate=0 events pretty far in advance) - but this seems like a less likely place to catch these problems.

tehn commented 6 years ago

just tested removing ws-wrapper from the equation. (had to use tmux, running screen to uart).

still experienced the time crush after wifi connect

simonvanderveldt commented 6 years ago

@scztt Thanks! We can try that I do expect we're using this one though https://github.com/supercollider/supercollider/blob/cca12ff02a774a9ea212e8883551d3565bb24a6f/server/scsynth/SC_Jack.cpp#L514

[edit] as @catfact said, if we're going to use supernova this probably won't be relevant.

catfact commented 6 years ago

... not if we're using supernova? seems totally different

@tehn agree, i got one too a couple hours ago.

scztt commented 6 years ago

@simonvanderveldt https://github.com/supercollider/supercollider/blob/d0c3e438ec9cefda19af08d01db0b536829458f5/server/supernova/sc/sc_osc_handler.cpp#L730 is the corresponding chunk for supernova.

This fix could possibly be mocked up in a very loose fashion using the /clearSched server message, which should do the same thing. If you could catch the clock change early, send a /clearSched, you might be able to keep incoherent time values out of the scheduler. This wouldn't be a reliable long-term fix, but it'll take five minutes to test and if it fixes then you'd have a pretty solid idea that the more rigorous schedule-clearing solution should work as well. I think you'd want to bundle /clearSched with a latency value of nil, which should cause it to be executed immediately.