i2csoft: count cycles for more accurate timing

This PR adds a new delay package that implements cycle-accurate timing, and changes i2csoft to use this package. This fixes https://github.com/tinygo-org/drivers/issues/557.

Here is the rest of the commit message with some interesting details:

This is implemented in inline assembly using machine.CPUFrequency() to know how long a single CPU cycle takes. As long as it is called with a constant duration, it should be fully inlined and all values can be const-propagated resulting in very tight inline assembly.

For example, when I convert i2csoft to use this delay function, the entire delay function compiles to something like this:

8784:   movs    r6, #100
8786:   mov     r0, r6
8788:   nop
878a:   nop
878c:   nop
878e:   nop
8790:   nop
8792:   subs    r0, #1
8794:   bne     0x8788

That means that all the math to calculate the number of cycles is entirely optimized away (in this case, to 100 loops).

I ran the example on a few boards to see how well it works:

board	100ms wait	CPU core
microbit	121.6ms	Cortex-M0 so it has 12% overhead
circuitplay-express	100.1ms	Cortex-M0+ so it is cycle accurate
pico	100.2ms	Cortex-M0+
pyportal	100.3ms	Cortex-M4
circuitplay-bluefruit	125.8ms	Cortex-M4
esp8266	125.1ms

This shows that there is some loop overhead because of conservative estimates, but note that even though there may be a 25% overhead, the actual overhead per delay.Sleep() call is very small. It should be good enough for software I2C at least, and can potentially be improved in the future.

tinygo-org / drivers

i2csoft: count cycles for more accurate timing #562