LinkClock tests sometimes fail

Environment

SuperCollider version: 3.11.2
Operating system: macOS 10.14
Other details (Qt version, audio driver, etc.):

Steps to reproduce

Setup:

add testsuite/classlibrary to your sclang path
recompile library before running each test

Run tests:

test_newFromTempoClock_reschedulesOldClockQueue - some iterations error out:

(
fork {
~suite = TestLinkClock();
20.do({
    ~suite.test_newFromTempoClock_reschedulesOldClockQueue;
    0.1.wait;
});
}
)

Result:

...
PASS: a TestLinkClock: new - starting a LinkClock with newFromTempoClock should reschedule stream players
Is:
     a LinkClock 
Should be:
     a LinkClock
ERROR: clock is not running.
ERROR: Primitive '_TempoClock_Sched' failed.
Failed.
RECEIVER:
Instance of TempoClock {    (0x1192c1b98, gc=7C, fmt=00, flg=00, set=03)
  instance variables [7]
    queue : instance of Array (0x10ffe2400, size=7, set=8)
    ptr : nil
    beatsPerBar : Float 4.000000   00000000 40100000
    barsPerBeat : Float 0.250000   00000000 3FD00000
    baseBarBeat : Float 0.000000   00000000 00000000
    baseBar : Float 0.000000   00000000 00000000
    permanent : false
}
PATH: /Volumes/data/Dokumenty/2020-2021/supercollider workspace/LinkClock failures.scd

PROTECTED CALL STACK:
    Meta_MethodError:new    0x115871f40
        arg this = PrimitiveFailedError
        arg what = Failed.
        arg receiver = a TempoClock
    Meta_PrimitiveFailedError:new   0x115878500
        arg this = PrimitiveFailedError
        arg receiver = a TempoClock
    Object:primitiveFailed  0x112f4eac0
        arg this = a TempoClock
    SimpleNumber:schedBundleArrayOnClock    0x116e49640
        arg this = 0.016
        arg clock = a TempoClock
        arg bundleArray = [ [ 15, 1016, gate, 0 ] ]
        arg lag = 0.0
        arg server = localhost
        arg latency = nil
        var sendBundle = a Function
    a FunctionDef   0x117315bc0
        sourceCode = "#{
                    var tempo, server, eventTypes, parentType;

                    parentType = ~parentTypes[~type];
                    parentType !? { currentEnvironment.parent = parentType };

                    server = ~server = ~server ? Server.default;

                    ~finish.value(currentEnvironment);

                    tempo = ~tempo;
                    tempo !? { thisThread.clock.tempo = tempo };

                    if(currentEnvironment.isRest.not) {
                        eventTypes = ~eventTypes;
                        (eventTypes[~type] ?? { eventTypes[\\note] }).value(server)
                    };

                    ~callback.value(current...etc..."
        var tempo = nil
        var server = localhost
        var eventTypes = ( 'fadeBus': a Function, 'freeAllocWrite': a Function, 'tree': a Function, 'vst_set': a Function, 
  'on': a Function, 'load': a Function, 'freeBuffer': a Function, 'group': a Function, 'freeAllocRead': a Function, 
  'allocWrite': a Function, 'cue': a Function, 'grain': a Function, 'Synth': a Function, 'vst_midi': a Function, 
  'freeAllocWriteID': a Function, 'alloc': a Function, 'rest': a Function, 'sine2': a Function, 'sine1': a Function, 
  'midi': a Function, 'set': a Function, 'setProperties': a Func...etc...
        var parentType = nil
    a FunctionDef   0x1172d6f00
        sourceCode = "<an open Function>"
    Function:prTry  0x115e071c0
        arg this = a Function
        var result = nil
        var thread = a Routine
        var next = nil
        var wasInProtectedFunc = false

CALL STACK:
    MethodError:reportError
        arg this = <instance of PrimitiveFailedError>
    Nil:handleError
        arg this = nil
        arg error = <instance of PrimitiveFailedError>
    Thread:handleError
        arg this = <instance of Thread>
        arg error = <instance of PrimitiveFailedError>
    Thread:handleError
        arg this = <instance of Routine>
        arg error = <instance of PrimitiveFailedError>
    Object:throw
        arg this = <instance of PrimitiveFailedError>
    Function:protect
        arg this = <instance of Function>
        arg handler = <instance of Function>
        var result = <instance of PrimitiveFailedError>
    Environment:use
        arg this = <instance of Event>
        arg function = <instance of Function>
        var result = nil
        var saveEnvir = <instance of Environment>
    Event:play
        arg this = <instance of Event>
    Event:playAndDelta
        arg this = <instance of Event>
        arg cleanup = <instance of EventStreamCleanup>
        arg mute = false
    EventStreamPlayer:prNext
        arg this = <instance of EventStreamPlayer>
        arg inTime = 0.1
        var nextTime = nil
        var outEvent = <instance of Event>
    < FunctionDef in Method EventStreamPlayer:init >
        arg inTime = 0.1
    Routine:prStart
        arg this = <instance of Routine>
        arg inval = 0.0
^^ The preceding error dump is for ERROR: Primitive '_TempoClock_Sched' failed.
Failed.
RECEIVER: a TempoClock

PASS: a TestLinkClock: new - starting a LinkClock with newFromTempoClock should reschedule functions
Is:
     a LinkClock 
Should be:
     a LinkClock
...

test_newLinkClock_shouldNotChangeLinkTempo - some iterations fail:

(
fork {
~suite = TestLinkClock();
20.do({
    ~suite.test_newLinkClock_shouldNotChangeLinkTempo;
    0.1.wait;
});
}
)

Result:

PASS: a TestLinkClock: new - creating a new clock should not affect existing link session
Is:
     2.5 
Should equal (within range 0.001):
     2.5
PASS: a TestLinkClock: new - creating a new clock should not affect existing link session
Is:
     2.5 
Should equal (within range 0.001):
     2.5

FAIL: a TestLinkClock: new - creating a new clock should not affect existing link session
Is:
     1.0 
Should equal (within range 0.001):
     2.5

PASS: a TestLinkClock: new - creating a new clock should not affect existing link session
Is:
     2.5 
Should equal (within range 0.001):
     2.5
...

test_LinkClock_sync_meter_aligns_barlines - some iterations error out or fail:

(
fork {
~suite = TestLinkClock();
20.do({
    ~suite.test_LinkClock_sync_meter_aligns_barlines;
    0.1.wait;
});
}
)

Result:


FAIL: a TestLinkClock: new - Count of successful trials should = number of trials
Is:
     4 
Should be:
     5

ERROR: LinkClock peers disagree on barline positions; cannot sync barlines
CALL STACK:
    Exception:reportError
        arg this = <instance of Error>
    Nil:handleError
        arg this = nil
        arg error = <instance of Error>
    Thread:handleError
        arg this = <instance of Thread>
        arg error = <instance of Error>
    Thread:handleError
        arg this = <instance of Routine>
        arg error = <instance of Error>
    Object:throw
        arg this = <instance of Error>
    MeterSync:prGetMeter
        arg this = <instance of MeterSync>
        arg replies = [*2]
        arg round = 1.0
        var bpbs = <instance of Set>
        var baseBeats = <instance of Set>
        var denom = 1
        var newBeatsPerBar = 4
        var newBase = nil
    < FunctionDef in Method MeterSync:resyncMeter >  (no arguments or variables)
    Routine:prStart
        arg this = <instance of Routine>
        arg inval = 11.724260061
^^ The preceding error dump is for ERROR: LinkClock peers disagree on barline positions; cannot sync barlines

Expected vs. actual behavior

Some tests in TestLinkClock either fail occasionally or throw errors. On my system this can only be observed when running multiple iterations of each test.

@jamshark70 do you have any ideas about these failures?

do you have any ideas about these failures?

5 second question, 45 minutes' research :laughing:

test_newFromTempoClock_reschedulesOldClockQueue -- There's some odd problem with the Pbind in this test sending messages -- somehow one of the messages is still on TempoClock, after that clock has been stopped. I can't figure out why. In any case, there's no need for the stream player to be generating OSC messages at all, so: if I change the one line to streamplayer = Pbind(\type, \rest, \dur, 0.02).play(tempoClock);, then I get no failures.

test_newLinkClock_shouldNotChangeLinkTempo -- consider this:

(
var i = 0;

f = { |tempo = 1|
    var new;
    var count;
    i = i + 1;
    count = i;
    new = LinkClock(tempo);
    SimpleController(new)
    .put(\tempo, {
        new.tempo.debug("clock % tempo".format(count))
    });
    new
};
)

a = f.(2.5);  // no tempo post
b = f.(1);  // quickly posts tempo

a.stop; b.stop;

If I hit a =, wait at least a second (or 2), then hit b =, b's posted tempo is always the expected 2.5. But if I hit them fairly quickly, even half a second between them, then b posts tempo 1 fairly often.

The likely explanation is that Ableton Link doesn't guarantee sync within a certain amount of time. The usual use case is that you open a music app, and the app will create one Link peer, and this peer is expected to be running for several minutes or hours. So it isn't important to have absolutely instantaneous handshaking: in most cases, you will set up the clock, and then start playing at least two or three seconds later.

But here, we crash into the requirements of our test suite, where we don't want individual tests to take too long. (Also, I think in this scenario, stress testing is largely useless because creating a Link is not something that you ever really stress, within one app.)

My opinion is that it is impossible to test LinkClock in pretty much any capacity according to the usual timing recommendations. This test is designed for a 1.0-second cycle and even that is not long enough. If I change the two wait statements to 1.0.wait instead of 0.5, then the failures disappear. But code reviews in the past were rather stern about 2000-ms tests.

test_LinkClock_sync_meter_aligns_barlines -- The test is supposed to create one reference clock -- then, a test clock should sync to it -- then we check that it synced, and delete the test clock, and try another test clock. So at any moment, there should be only one reference clock and one test clock.

"ERROR: LinkClock peers disagree on barline positions" means that at the moment of trying to sync a test clock, there exist at least two reference clocks. (It's impossible to have a barline conflict with only one.) The test code is pretty clearly doing testClock.stop; before looping back to create the next one, so this should never happen. So -- maybe it takes a short time to dispose of the previous test clock? Or...? I dunno... Maybe try adding a short wait after stopping the test clock.

FWIW I can't reproduce the failure with this test, on multiple runs of your 20-cycle loop.

Thank you @jamshark70 for looking into this!

I'll try to implement more relaxed timing (with a note that the test(s) are unusually long for the reasons you've provided) to see if we can get these tests back in. I'd like to do that partially because I wanted to have something to test https://github.com/supercollider/supercollider/pull/5416 with :)

Update: I thought that maybe the LinkClock tests that fail when run inside the test suite are the ones that use the Semaphore object. ~~I'm speculating that there might be an interaction between routines and try {} statements of our test runner (or maybe some other aspect of that script?) and the Semaphore behavior.~~ Ultimately, I don't know what causes these issues. Here's an example output of one of the errors:

``` Test: test_settingTempo_maintainsSyncOver20To999BPM writing to file: /Users/runner/work/supercollider/supercollider/testsuite/scripts/run/gha_test_run.json ERROR: Message 'prNumPeersChanged' not understood. RECEIVER: Instance of Frame { (0x7fea29f6db88, gc=B0, fmt=01, flg=00, set=03) indexed slots [8] 0 : instance of Method ArrayedCollection:do (0x7fea34ad3580) 1 : Integer 0 2 : Frame (0x7fea29f6db88) of ArrayedCollection:do 3 : Frame (0x7fea29f6db88) of ArrayedCollection:do 4 : RawPointer 0x7fea34ad3969 5 : ""test_asSecs_stringSsSss_convertsToSeconds"" 6 : instance of Function (0x7fea2389cc28, size=2, set=2) 7 : Integer 43 } ARGS: Integer 0 PROTECTED CALL STACK: Meta_MethodError:new 0x7fea34006780 arg this = DoesNotUnderstandError arg what = nil arg receiver = Frame (0x7fea29f6db88) of ArrayedCollection:do Meta_DoesNotUnderstandError:new 0x7fea34008a40 arg this = DoesNotUnderstandError arg receiver = Frame (0x7fea29f6db88) of ArrayedCollection:do arg selector = prNumPeersChanged arg args = [ 0 ] Object:doesNotUnderstand 0x7fea333bf280 arg this = Frame (0x7fea29f6db88) of ArrayedCollection:do arg selector = prNumPeersChanged arg args = nil CALL STACK: DoesNotUnderstandError:reportError arg this = Nil:handleError arg this = nil arg error = Thread:handleError arg this = arg error = Object:throw arg this = Object:doesNotUnderstand arg this = Frame (0x7fea29f6db88) of ArrayedCollection:do arg selector = 'prNumPeersChanged' arg args = [*1] ^^ ERROR: Message 'prNumPeersChanged' not understood. RECEIVER: Frame (0x7fea29f6db88) of ArrayedCollection:do ERROR: Message 'prNumPeersChanged' not understood. RECEIVER: Instance of Frame { (0x7fea27d275b8, gc=B0, fmt=01, flg=00, set=03) indexed slots [7] 0 : instance of Method String:+ (0x7fea349fd3c0) 1 : Integer 0 2 : Frame (0x7fea27d275b8) of String:+ 3 : Frame (0x7fea27d275b8) of String:+ 4 : RawPointer 0x0 5 : ""suite":" 6 : ""TestString"" } ARGS: Integer 0 CALL STACK: DoesNotUnderstandError:reportError arg this = Nil:handleError arg this = nil arg error = Thread:handleError arg this = arg error = Object:throw arg this = Object:doesNotUnderstand arg this = Frame (0x7fea27d275b8) of String:+ arg selector = 'prNumPeersChanged' ```

BTW I have tried suggested fixes and while they did seem to work for running multiple tests in a row locally, they did not prevent the errors from occurring in the CI.

I tried to turn some of the tests back on in #5702 but ultimately failed, since the test suite kept freezing when running these tests, e.g. https://github.com/supercollider/supercollider/runs/4875282243?check_suite_focus=true

https://github.com/supercollider/supercollider/issues/232

~~Try within a loop is~~ Loops within try not reliable, and it's not going to be reliable until that bug is fixed. Unfortunately nobody knows how.

What is odd about this specific error is that prNumPeersChanged is called only from the backend -- never as a method invocation from the language.

    void OnNumPeersChanged(std::size_t numPeers) {
        gLangMutex.lock();
        g->canCallOS = false;
        ++g->sp;
        SetObject(g->sp, mTempoClockObj);
        ++g->sp;
        SetInt(g->sp, numPeers);
        runInterpreter(g, getsym("prNumPeersChanged"), 2);
        g->canCallOS = false;
        gLangMutex.unlock();
    }

We lock the interpreter, push the LinkClock object, push an integer, and call the method -- but the interpreter applies the message to a totally unrelated object. I can't explain that other than something is mucking around with the stack between the backend calling runInterpreter and the interpreter actually dispatching the message... but this should not happen because the interpreter is locked.

supercollider / supercollider