crone-softcut: finalize structure / features for 2.0

catfact commented 5 years ago

discussion so far has produced this task list. kinda putting in order of complexity (low->high).

(checking these as they are added to open PRs. doesn't mean they are fully tested.)

(current tracking PR: #705)

[x] disable post-resampling LPF
[ ] add additional playback SVF (and commands)
[ ] (maybe) move softclipper before resampler
[ ] add softclipper config commands
[x] add playback flag, and command
[ ] (maybe) add CPU usage measurement to softcut client
[x] add second buffer, command to select buffer per voice
[x] add read_stereo command
[x] add "sync/desync voice" function, and commands
[x] conditionally disable mixing of voice IO busses based on flag status
[ ] add de-clicking for head crossing (duck or switch+ramp)

this issue is for discussion of proposed feature additions / structural changes to the softcut effect client, in particular regarding their performance impact.

first, a high-level overview of the dsp graph for a single softcut voice looks something like this:

softcut-voice

profiling

i used prof to sample the entire crone process with 4 voices. each voice was setup in a simple echo configuration and running at a high speed (rate = 8.0). this is important because it essentially means that every process between the resample block and the actual buffer write is running at 8x the audio sample rate.

the flame chart for the crone process looks like this: crone-flame-chart

and a more detailed view of the call stack for the softcut client looks like this:

softcut-client-call-tree

some notable results:

(all percentages are all given in terms of the total crone process).

simply mixing all the busses in crone is pretty expensive: ~27% just on softcut inputs. this might seem weird, but it makes sense; there is a full NxN matrix with audio-rate level smoothing between all softcut I/O points. the inner loop is fairly well optimized; further optimizations would involve keeping a lot more state (buffers of amplitude level envelopes.) we could also totally disable processing of mix points that are zeroed, but this doesn't help the worst case.
the SVF is quite cheap: ~2%. there would be no problem at all adding a second SVF stage to each voice output. (why an audio-rate LPF needed at the input is a longish story; basically it has to be there and has to have the ability to modulate according to the write-phasor rate; otherwise you are essentially downsampling recorded material to SR=0 when sweeping the phasor speed through zero.)
the "brickwall" lowpass, on the other hand, is quite expensive: ~12%! it is not a particularly expensive algo (basic fixed 5th-order butterworth at half-nyquist or so); this high usage is because it is applied after effectively upsampling the input.
the soft clipper is also fairly heavy: ~8%. this is actually a fairly lightweight process: something like ~10 intructions / sample; but again it is effectively running with 8x oversampling.

so, i would eliminate the fixed lowpass. it's intention is to help with possible resampling artifacts; but, the resampler is using a hermite interpolator and is pretty clean; i don't think the additional stage makes much difference.

i would like to keep the soft clipper; i think it is important to gracefully limit signal level right before it hits the buffer, and exposing its gain/knee parameters could make for interesting musical effects.

input resampling before write: ~18%, making this the single largest computation - as expected. (the total load for the "poke" function is ~40%, but this includes the LPF and sofclip described above.)
"peek" function: ~4.5%. the actual hermite interpolation has a tiny footprint of ~0.5%; the rest is lookup and indexing. (could probably be improved a bit.)
crossfade mixing: ~2.5%.

playback versus record

as shown, computing the record head is massively more expensive than computing the play head - at least in this "worst case" scenario where the phasor speed is high. (cost for recording varies ~linearly with speed; cost for playback is ~fixed.) in this test (including the stupid LPF) the difference is a factor of 10x

so, there are a few related proposals for leveraging this fact to increase total voice count: 1) make some voices that cannot be used for recording 2) totally decouple record heads from play heads, with some mechanisms for locking them in sync 3) just have lots of voices, with the understanding that not all can be enabled for recording simultaneously. any voice can have record and/or playback disabled at any time. (there is already a record flag; adding a play flag would be simple.)

its important to keep in mind the cost of more points in the softcut matrix mixer, which is not inconsiderable.

solution (1) is probably the simplest. if there are N play+record voices, and M playback-only voices, then the matrix mixer can be reconstructed for N inputs and M+N outputs.

solution (2) is attractive, but involves a bigger refactor and the sync mechanism isn't totally clear in my mind. (but see below.)

for solution (3) we would also want to make the matrix mixer able to disable given rows/columns when corresponding record/play function is toggled. but that shouldn't be too big a deal.

stereo versus mono

a common request is for applications like mlr to operate on stereo files/signals. the way i would implement this is to have a subclass of softcut head that performs all read/write/mix as usual, but doesn't compute any of it's own position; instead it is always sync'd to another head and just points at a different buffer (or position in buffer, potentially) and uses a different I/O bus.

the relevant takeway w/r/t performance for this kind of feature, is that the positional logic is actually a pretty insignificant part of the computation cost.

solution (2) above would dovetail nicely with this feature since i think it could use a similar or identical sync mechanism.

in fact now that i think of it, solution (3) plus the sync feature would maybe cover all the use cases: any combination of mono/stereo voices, and any combination of record/play heads, within CPU constraints.

play head ducking

i'd like to add a feature whereby each play head "ducks" its level whenever its position approaches that of any record head. this works to eliminate a class of glitches (well, replacing them with smooth dropouts, but other approaches quickly become very complex.) however, the cost of computing this increases exponentially with the count of (active?) record heads. i'll need to try an implementation here to get a baseline performance hit, but it's something to bear in mind.

ok, phew.. please feel free to leave feedback here. in the meantime i will probably disable the post-resampling LPF and add a post-playback SVF, for a net savings of 16% or so on CPU load.

tehn commented 5 years ago

thank you for the excellent assessment (with rad graphix)

i think option #3 is very sensible. particularly with the conditional muting of rows in the matrix mixer with play_enable.

agreed re: softclip. seems very easy to digitally overload inputs with all of the potential mix inputs... and it'd be nice to have that sound nice

re: sync for stereo vs. mono. curious what you propose for syntax. some sort of fixed channel arrangement? ie sync 1-2, 3-4, etc.

re: play head ducking. this is a very attractive feature. i understand the cpu ramifications of multiple rec heads on.

aside #1: would it makes sense to simply limit the number of rec heads that can be enabled at once? ie, others would have to be disabled before additional could be enabled. there's something annoying about this, of course.

aside #2: it occurred to me that perhaps 8x rate is overkill and adds a more substantial worse-case CPU to all scenarios? there are workarounds if extreme-pitch-shifting is desired ie rec at 0.25x then play at 4x or course.

it seems difficult/impossible to design maximal flexibility without introducing constraints such as above or trusting the user script (and educating about over-cpu cases.) theoretically matron could monitor specifically crone's cpu usage as a guide?

you instinct to remove post-resample LPF and add post-playback SVF seem very good.

if you were to add mix matrix muting (which seems like biggest cpu savings?) how many voices would you propose having?

thank you again for the awesome work on this.

ranch-verdin commented 5 years ago

Have trouble visualising a musical utility to decoupled read/write head! For looping duties, it helps to hear what you're layering on top of (whatever speed it's playing at). If you're looking for unpredictable results, surely there's going to be some hack using new blank buffer to achieve similar/same effect? Given that it's such a technical headache, would (tentatively) suggest coupled write/read head as a fundamental limitation.

Would also suggest that recording to more than two places at once (because two inputs on the device) is probably a rare edgecase with workarounds to achieve same musical aims.

Anyway, amazing work on this - I am just getting back up and running with my toolkit to look at your tsq (planning to move it over into some faust lib). Slightly daunting how advanced this has got since lines code, imagine doing it all in fixed-point, bahaha!

catfact commented 5 years ago

it occurred to me that perhaps 8x rate is overkill

agree, 4x seems like a reasonable limitation?

it also occurs to me that it might be fine to move the softclipper to before the resampler. i had it afterwards initially because of the possibility of small overshoot from cubic interpolation. i'll try and discover if this is actually an issue; if it is, might be enough to set the softclipper output limit a little lower.

Have trouble visualising a musical utility to decoupled read/write head!

i can imagine some things, being able to punch in arbitrarily to a region that isn't being played.

but anyway, we already have decoupled rd/wr heads, in that each "voice" has both rd+wr, with arbitrary offest, and (crucially) all voices actually access the same buffer.

all told, i agree that solution (3) seems like the way to go: no change to the voice structure itself, but add play flag; skip input mixing for any voice with rec=false, skip output bus mixing for any voice with play=false.

additionally, i will add some kind of synchronization mechanism between voices. the simplest would be a command that says, "set voice A position to voice B position [plus offset] at the start of the next block." and then just assume that if their rates are set identically, position drift won't be an issue. i'm not sure this is sufficient but it's very easy to try.

and finally, i will add a second buffer, an option to assign each voice to either buffer, and read_stereo command that reads a stereo soundfile to both buffers.

i don't see any major headaches here, just several minor changes to slog through.

if you were to add mix matrix muting (which seems like biggest cpu savings?) how many voices would you propose having?

errr.. not sure. 6 or 8 doesn't seem unreasonable. agree that it would be good to have some kind of realtime usage report. i guess it would be sufficient to count cycles before and after client block-process method.

ranch-verdin commented 5 years ago

all voices actually access the same buffer

Yes, of course, ok my suggestion doesn't work (in fact, the line of reasoning leads to behaviour in grains module, pretty different)

Here's another way to solve the 'event horizon' problem better than ducking read head volume (hopefully the fact I'm thinking sample-by-sample doesn't invalidate the idea):

cache the 1st order discontinuity introduced at record head 'event horizon' after record head finishes writing buffer samples
as play head crosses the discontinuity, read the necessary correction off write head buffer object, then fade it out over the next millisecond or 2.

catfact commented 5 years ago

ha, yes something like that is definitely what i meant by "complex," but that's a compelling description.

yeah, position logic is all sample-by-sample so something like this could be not too hairy. the details could add up? but i like the idea of only storing the discontinuity (single value.) so in other words, IIUC:

each write head (wrN) stores the difference between the old value in the buffer, and the value it wrote, on each sample. call this value dN = [new value] - [old value].
each read head (rdM) must query the position of each wr. if rdM is crossing wrN, it adds -dN as an offset to its value, and begins fading-out the offset.

is that enough though? it seems like it would really need a longer sample of the old buffer contents - maybe that's what you meant by "correction"? then it gets more involved.

i'm not sure having a few ms of smooth dropout would be so bad - having an arbitrary offset and fadeout could very will produce perceptual silence anyway (by pushing into saturation / DC). and the ducking is so very simple in comparison.

but, i would be stoked if you wanted to roll a proof of concept for the click-repair behavior!

tehn commented 5 years ago

this is pretty much the switch-and-ramp solution which has been implemented in some versions of mlr cutting in the past, right? http://msp.ucsd.edu/techniques/v0.11/book-html/node63.html

it's been a long time, but i think it still introduces a sort of audible artifact, i don't remember exactly but i expect it's varying degrees of low-passed pops (which makes sense) which is certainly better than a full-on discon click.

short volume ducking seems ok to me, short of some way more cpu intensive solution (ie perpetually cross-fading/windowing the the rec head write, which seems crazy) though perhaps some dumb-overkill option could be toggle-able, as we've chosen this path of maybe-you-can-overload-the-cpu?

ranch-verdin commented 5 years ago

Yup it's switch and ramp, and i'm surprised the artefact would be a bad one! Proof of principle experiment (listening test) sounds like my idea of fun, but please don't let that stop progress on the volume ducking work around... Can add switch and ramp as a feature later if the experiment really seems promising

catfact commented 5 years ago

wow, all this time i had a totally different concept of "switch and ramp." i thought it referred to the crossfade of two separate processes (e.g. delay times or filter coefficient sets.) wow. nice to learn something.

@tehn i wonder if old attempts at implementation in MLR were really doing an accurate job. it seems like a tricky thing to get right in max, pre-gen~. but i can also imagine that the effect of this would be a little less predictable than a simple duck using a cosine envelope.

@ranch-verdin i'd be happy if you wanted to give it a shot. i know the current softcut implementation is kindof a lot to take in (there is at least 1 unnecessary layer of abstraction, for example) but i've been trying to keep it pretty clean; LMK if there are unclear parts. but i'm not going to get to any kind of de-clicking attempt until all these other boxes are checked and tested.

the other feature you've mentioned, that i'd like to see a test of, is the idea queuing position changes that come in during a crossfade (instead of ignoring them.) that would go here: https://github.com/catfact/norns/blob/softcut-improvements/crone/src/softcut/SoftCutHead.cpp#L120

(this is pointing to my current branch, which has open PR at #705)

tehn commented 5 years ago

it was years ago and my memory is likely not clear on this--- conceptually it makes sense that the discontinuity should be reduced to a low-passed pop, right? which is also a totally acceptable artifact, and it should be not always a similar sound when this rather-rare event happens (of course, you could make it happen a lot by having a playhead and rec head with similar loops playing in opposing directions, which sounds super good by the way)

@catfact yes i agree any max implementation would've been a little crappy

catfact commented 5 years ago

oh right, i should mention that i've updated the very top of this issue description with some task boxes to track progress

tehn commented 5 years ago

cpu polling was just a suggestion, might not be essential.

ranch-verdin commented 5 years ago

look, code!

https://gist.github.com/ranch-verdin/152f5bcb858c4b58b42d3ad61ae5f786

gcc main.c -ljack

It's a 3 second buffer with read head running double speed. Input is write-head input. first output is de-clicked read head output, second output is the correction (used for my debug), don't plug the correction to headphones, just output 1.

Well it's more of a thump than a pop, don't think it sounds too bad, will compare it to the ducked version tomorrow night maybe... time for sleep!

catfact commented 5 years ago

hey, nice! thanks for doing that!

for comparison, i made a version with ducking instead, here: https://gist.github.com/catfact/0d81ca31c5bf2d199a03ee2564fc9ff7

i did some tests with a sustained, low-harmonics oscillator sound; this is the easiest way for me to hear this kind of artifact.

the switchramped version kinda sounds like the faded portion is at the old rate... or something? (pitched-down, in this case.) weird, can't totally explain that.

in any case its a definite pop / amplitude spike, just softer than a harsh click, and this makes perfect sense - it's basically integrating the discontinuity (approximately.)

i gotta say the duck sounds a little better to me. seems like the implementations are similar in complexity.

please let me know if i'm getting something wrong!

here's the waveforms:

(switchramp) declick-switchramp

(duck) declick-duck

and here's the audio files:

declick-tests.zip

catfact commented 5 years ago

alas.. in trying to integrate de-clicking this i realize a pretty major obstacle which is that each voice is processed in a 128 sample block.

my instinct is that changing this to loop first over samples, then over voices, will cause a significant performance hit (thrashing cache, hindering compiler visibility of inner loop.) but this assumption is worth testing...

ranch-verdin commented 5 years ago

the switchramped version kinda sounds like the faded portion is at the old rate... or something? (pitched-down, in this case.) weird, can't totally explain that.

I do understand why this is - write head keeps updating the correction, even after read head passes it, effectively mixing part of the write head input signal into the read head. To solve that problem with my approach the correction should only computed as the heads pass. Tried that and yes, it still sounds pretty click-ey. Neat tip to use a buzz-ey oscillator, makes the artefact much more noticeable - was using sine waves before.

I am going to try one last twist on this, just out of morbid curiosity...

https://gist.github.com/ranch-verdin/5c640b3441aff14841687e8d3858b9fc

bahahahah! sounds pretty weird on the join when listening to the buzz-ey oscillator and almost an interesting sound on audio program but realistically, yes volume duck is best.

I guess you already realised to scale the width of the duck based on relative head velocities, so it shouldn't affect reading near the write head with heads running same speed. I only just twigged onto that aspect.

Will make myself useful and have a proper look at the cut queueing feature soon (gave it a first inspection earlier this evening)...

catfact commented 5 years ago

I guess you already realised to scale the width of the duck based on relative head velocities,

oh ha, no.... that didn't occur to me at all, great point. dangit...

ranch-verdin commented 5 years ago

A pretty 'correct' solution, which should permit to put read head very near write head at same speed, then change read head speed. Instantaneous head speed changes are now forbidden because they cause the duck width to jump.

In pseudo-code:

Slew_head_velocities_to_targets(5ms slew rate)

duck_width = abs(v_read - v_write); // the eased/slewed velocities

duck_width *= 5ms

If(inside duck region) { Duck(read_sample) }

catfact commented 5 years ago

added a (very basic) sync mechanism.

currently, this is a "one-shot" sync command - it simply sets the position of voice A to that of voice B at the beginning of the next audio buffer. the command takes a 3rd argument which is an offset in seconds.

it follows that you don't want both voices recording to the same buffer in this situation. (but i'll presently add a 2nd buffer for stereo recording applications.)

but there's something odd happening: with 1 voice recording (the leader, in my tests so far,) the follower ends up reading from within the resampling window, causing artifacts, unless a small offset is specified. (-10 samples.)

i don't quite get why this is, since both voices already have a read/write offset. hm. probably something stupid.

in any case, a more robust "sync'd" mode would involve refactoring the main softcut loop to be sample-first, rather than voice-first, as mentioned above. that's not hard to do but should keep an eye on performance impact.

oh, also should mention: the "one-shot" sync doesn't know/care about the relative rate of the voices. so of course they will drift if the rates are different - or in the midst of a slew.

now... its maybe worth noting here that the slew used for rates is the same as that used for levels - a simple 1-pole integrator. it doesn't actually do anything smart like clamp the current value to the target when they are close enough. (there is a gremlin-zapper routine in there, but it was causing weird slowdown and is currently disabled.) so... that could (conceivably) be a source of rounding error and subsequent drift. (yes, this should probably be another issue.)

a proper sync flag would entirely bypass position updating for the sync'd voice.

catfact commented 5 years ago

added second buffer. some commands are new, some have changed. haven't tested everything yet...

tested:

by default, voices use alternating buffers
command to select buffer for voice

not tested!

buffer/readhas been split into read_mono and read_stereo. previously, there was one channel argument (source channel.) now read_mono takes 2 channel args (src, dst) and read_stereo takes no channel args (it reads the first 2 channels of the source into the 2 buffers, or does nothing if the source is mono.)

tested:

buffer/clear commands likewise have channel arguments now. however, i didn't see the need to change existing commands - they simply affect both channels. i added new commands clear_channel and clear_region_channel that take an additional first argument.

ranch-verdin commented 5 years ago

the other feature you've mentioned, that i'd like to see a test of, is the idea queuing position changes that come in during a crossfade (instead of ignoring them.)

Would like to make a more concrete contribution, this looks less intertwined with performance issues than 'write head declicking'...

Pull request to @catfact's working branch:

https://github.com/catfact/norns/pull/2

currently untested (other than checking it compiles), would appreciate any suggestion how to test this thing. I guess you guys must have established some pretty solid testing methods, don't feel much like reinventing the wheel this afternoon...

catfact commented 5 years ago

finished and pushed the branch softcut-loop-deepfirst on my fork https://github.com/catfact/norns/tree/softcut-loop-deepfirst

this refactors voice processing so that samples are looped first, then voices.

there's a definite performance hit. have only tried on laptop so far; went from [29, 32]% to about [33, 36]. i'll compare on norns.

i will think about how to improve this.. like make sure the lambdas are decomposing to FPs in the way i expect.

but i don't think there's an easy fix for this given the amount of stuff that needs to be on stack to process one sample for one voice. the hit is not as bad as my worst fears.

i don't really see a way to get around having to do this loop structure inversion in order to add the listed features - namely read/write ducking, and solid (per-sample) sync between voices.

you guys must have established some pretty solid testing methods

oh, ha.. well not solid enough really, quite ad hoc.

what i do is basically:

build crone on laptop with CMake + CLion. run / debug.
use the supercollider scripts in crone/osc-tests to send OSC directly to the crone process
[repeat]
when it looks good, build with waf on norns.
use the same SC scripts, just sending to norns' IP address
if needed (rarely), use gdbserver to debug remotely.

the SC scripts definitely don't cover all configurations of softcut. not even remotely. would be smart to make a more systematic test procedure but of course its hard with limited development time.

i've rarely taken the time to do matron/lua glue and test with actual norns scripts. i really really need to take the time to make some instead of compulsively working on the DSP side. probably.

the fade queuing looks good! i'll give it a spin. i do wonder just if the behavior is gonna "feel weird' for long fades - getting a change in position up to [fade time] later than you asked for it.

ranch-verdin commented 5 years ago

Definitely 'feels weird' for 2 immediate fade requests and a long fade. If you are trying to start next fade immediately after last one finished, but hit the button slightly too early, my change is good.

catfact commented 5 years ago

still gonna come back to some of these but will open more focused issue(s) (maybe)

monome / norns