Size reduce attempt 2 - Githubissues

yatli commented 4 years ago

A continuation of https://github.com/jmamma/MIDICtrl20_MegaCommand/pull/108

Cherry-picked the most significant items:

remove default font
don't force inline unless we're in ISR, or doing ram bank switch
compact menu layout

Size goes down significantly, from 85% (216718) to 77% (197232) 19486 bytes saving!

yatli commented 4 years ago

https://1drv.ms/x/s!Aql6jSQH9uyZhP1RR_0UWsqnGg_gVw?e=cIIBuV fits nicely with the theory.

jmamma commented 4 years ago

Unfortunately we still use glcdfont.c. That's our default "large font"

It's selected using: oled_display.setFont();

i.e

./Adafruit_GFX.h:    setFont(const GFXfont *f = NULL),

void Adafruit_GFX::write(uint8_t c) {
 #endif
   if (!gfxFont) { // 'Classic' built-in font

yatli commented 4 years ago

ah ok. reverted.

now 198792 bytes (78%)

jmamma commented 4 years ago

ah ok. reverted.

yeah I got excited for a minute there...

Can you clarify

ALWAYS_INLINE vs FORCED_INLINE.

I see you've defined IS_ISR_ROUTINE in select files.

yatli commented 4 years ago

That's the interesting part.

ALWAYS_INLINE is only effective if IS_ISR_ROUTINE is defined, so for any function defined as such, it automatically builds two versions, inlined and non-inlined.

FORCED_INLINE disregards IS_ISR_ROUTINE and always inline.

yatli commented 4 years ago

This is to preserve the critical ISR inlining while freeing up space from the other routines, without writing another copy of the functions.

jmamma commented 4 years ago

This is to preserve the critical ISR inlining while freeing up space from the other routines, without writing another copy of the functions.

yep, awesome.

yatli commented 4 years ago

Could you do some regression testing. I'm not exactly sure the effect of un-inlining things in non critical paths. The most significant changes are shown in the spreadsheet in yellow so that may give you a rough idea.

jmamma commented 4 years ago

Could you do some regression testing. I'm not exactly sure the effect of un-inlining things in non critical paths. The most significant changes are shown in the spreadsheet in yellow so that may give you a rough idea.

Yeah, my anxiety levels already started rising .

jmamma commented 4 years ago

Originally I had intended to inline every call present in the Sequencer ISR path. Which is everything indirectly called by mcl_seq.seq();

So inline all the MD/ExtSeqTrack.seq() calls, including the MidiUart.note_on and CC calls which are called indirectly from MD.triggerTrack() , MD.setTrackParam() etc;

This reasoning behind this was because I was trying to cram the entire sequencer run time within a single ISR event. However, this didnt work in practice. The ISR takes too long to complete before another ISR is triggered, resulting in missed MIDI clock events. So we end up having to re-enabling interrupts anyway.

223   ALWAYS_INLINE() void callCallbacks(bool isMidiEvent = false) {
224     if (state != STARTED)
225       return;
226 
227     static bool inCallback = false;
228     if (inCallback) {
229       DEBUG_PRINTLN("clock collision");
230       return;
231     } else {
232       inCallback = true;
233     }
234 
235 #ifndef HOST_MIDIDUINO
236     sei(); <--- Interrupts re-enabled before sequencer callback is executed.
237 #endif
238 
239     on192Callbacks.call(div192th_counter);
240 
241     if (isMidiEvent) {
242       on96Callbacks.call(div96th_counter);
243 
244       if (mod6_counter == 0) {
245         on16Callbacks.call(div16th_counter);
246         on32Callbacks.call(div32th_counter);
247       }
248       if (mod6_counter == 3) {
249         on32Callbacks.call(div32th_counter);
250       }
251     }
252 
253     inCallback = false;
254   }

Midi/MidiUartParent.hh contains the low level MIDI calls that I wanted forced inlined.

@yatli Are you able to objdump mcl_seq.seq() to see how many subroutines are being executed and compare old code with new?

yatli commented 4 years ago

seq() size change:

Current result shown on the right. If totally un-inline everything the result is shown in the middle. So it's kind of a balance

jmamma commented 4 years ago

I guess we need to benchmark execution times to make an informed decision here.

yatli commented 4 years ago

My rough estimation is that anything bigger than 10 instructions won't be impacted too much if it's pulled out

jmamma commented 4 years ago

My rough estimation is that anything bigger than 10 instructions won't be impacted too much if it's pulled out

What's your reasoning here?

yatli commented 4 years ago

A call is 5 cycles, a ret is 5 cycles. Most instructions are 1 or 2 cycles, so calling into a 10 instruction routine results in 50% speed. As the number of instructions increase, the benefits of inline diminishes. The speed approximates 100%

jmamma commented 4 years ago

A call is 5 cycles, a ret is 5 cycles. Most instructions are 1 or 2 cycles, so calling into a 10 instruction routine results in 50% speed. As the number of instructions increase, the benefits of inline diminishes. The speed approximates 100%

understood.

jmamma commented 4 years ago

Tomorrow I will hook up the scope and profile execution times for seq(). Going to continue on piano_roll for now.

jmamma commented 4 years ago

Just started profiling the performance.

Test scenario.

+    SET_BIT(PORTB, PB5);
     on192Callbacks.call(div192th_counter);
-
+    CLEAR_BIT(PORTB, PB5);

Test branches dev vs size_reduce_2 using same pattern.

Measurement oscilloscope at 500us horizontal scale.

Both dev and size_reduce_2 appear to perform similarly.

1.1 x 500us shortest completion time (0.5ms)

4 * 500us worst case (2ms)

--

From the scope i'm unable to discern any notable difference between the two.

atmega2560 runs at 16MIPS / 16MHZ

therefore it can perform 16000 instructions/ms

@yatli Given that these time scales are orders of magnitude larger than our fastest timer clock++ and that ISR is enabled before on192Callbacks.call We can can actually use the megacommand timer to measure execution times.

jmamma commented 4 years ago

At 128BPM

B = 1 beat is ((1/(128 / 60)) = 0.46875 seconds long

Edit: calcs were wrong.

(B / 4) / 6 = 19.5ms per MIDI_CLOCK

9.75 ms per div192.

jmamma commented 4 years ago

I guess if we wanted to tighten up the sequencer we could pre-calculate the MIDI payload prior to the clock pulse. and then memcpy in to the midi tx buffer on clock (or swap pointer to buffer on clock..).

yatli commented 4 years ago

Does this mean the size reduction doesn't impact performance?

jmamma commented 4 years ago

scope is not revealing anything abnormal between the two.

we should double check with timer method as I was just using my :eyes: and relying on the refresh rate of the digital lcd.

@yatli can I leave that up to you to check? serial.print + excel spreadsheets :D . You'll need to store ISR execution times in RAM, and then print during main program loop (outside of ISR)

yatli commented 4 years ago

Sure! Benchmarks heh. All you need to write papers (

yatli commented 4 years ago

https://1drv.ms/x/s!Aql6jSQH9uyZhP1T8-w7P7Ru-RtZag?e=6Sa4eI

Not much of a difference.

yatli commented 4 years ago

The two compared branches:

https://github.com/jmamma/MIDICtrl20_MegaCommand/tree/benchmark

https://github.com/jmamma/MIDICtrl20_MegaCommand/tree/dev_benchmark

I find debugging more enjoyable on DiagnosticPage than serial port(

yatli commented 4 years ago

jmamma commented 4 years ago

microseconds ?

yatli commented 4 years ago

yes

yatli commented 4 years ago

more notes = more jittering.

blocking on uart?

jmamma commented 4 years ago

Jitter as in ISR completion jitter?

Yeah one possible cause could be the UART transmit interrupt triggering more often because of more data on the buffer. So you'd get a bit of thrashing between the Seq ISR and UART transmit ISR.

jmamma commented 4 years ago

Jitter as in ISR completion jitter?

Yeah one possible cause could be the UART transmit interrupt triggering more often because of more data on the buffer. So you'd get a bit of thrashing between the Seq ISR and UART transmit ISR.

yatli commented 4 years ago

Yeah. The benchmark measures the time of div192 callback. And the sampled time is drawn on screen periodically. I then hit play/stop multiple times to get the readings. The "jittering" happens like in 1 of 10 times.

yatli commented 4 years ago

I think this is good to merge now.

jmamma commented 4 years ago

Yes, I didn't notice any issues in testing.

Will merge now.

jmamma / MCL

Size reduce attempt 2 #125