loboris / MicroPython_K210_LoBo

MicroPython implementation for Kendryte K210
128 stars 24 forks source link

Firmware: Seems to run slow #1

Closed robert-hh closed 2 years ago

robert-hh commented 5 years ago

I tried to run a file called runtest.py (copied below), which was used a while ago in the MP forum for speed tests. The results are odd. For instance, the add test takes about 15 seconds at 400MHz 25 seconds at 240 Mhz, compared to the about 3 seconds on the sipeed image, about 1 second on a Pyboard, 2 (non SPIRAM), 3 (SPIRAM) seconds on a ESP32. The numbers are not fully comparable, since the K210 uses 64 bit numbers instead of 32 bit, but 15 seconds just strange.

import time
import machine

def pi(places=100):
  # 3 + 3*(1/24) + 3*(1/24)*(9/80) + 3*(1/24)*(9/80)*(25/168)
  # The numerators 1, 9, 25, ... are given by (2x + 1) ^ 2
  # The denominators 24, 80, 168 are given by (16x^2 -24x + 8)
  extra = 8
  one = 10 ** (places+extra)
  t, c, n, na, d, da = 3*one, 3*one, 1, 0, 0, 24

  while t > 1: 
    n, na, d, da = n+na, na+8, d+da, da+32
    t = t * n // d
    c += t
  return c // (10 ** extra)

def pi_test(n=5000):
    t1=time.ticks_ms()
    t=pi(n)
    t2=time.ticks_ms()
    print('Pi test: ', time.ticks_diff(t2,t1)/1000, 's')

def add_test(n=1000000, a = 1234, b = 5678):
    t1=time.ticks_ms()
    sum = 0
    for i in range(n):
        sum = a + b
    t2=time.ticks_ms()
    print('Add test: ', time.ticks_diff(t2,t1)/1000, 's')

def mul_test(n=1000000, a = 1234, b = 5678):
    t1=time.ticks_ms()
    sum = 0
    for i in range(n):
        sum = a * b
    t2=time.ticks_ms()
    print('Mul test: ', time.ticks_diff(t2,t1)/1000, 's')

def div_test(n=1000000, a = 1234, b = 5678):
    t1=time.ticks_ms()
    sum = 0
    for i in range(n):
        sum = a / b
    t2=time.ticks_ms()
    print('Div test: ', time.ticks_diff(t2,t1)/1000, 's')

print('Speed test')
try:
    print('System freq: {:.1f} MHz'.format(machine.freq()[0]/1000000))
except:
    print('System freq: {:.1f} MHz'.format(machine.freq()/1000000))

add_test()
mul_test()
div_test()
pi_test()
loboris commented 5 years ago

It has nothing to do with the calculation speed, but with loop execution speed. There is a VM hook enabled (MICROPY_VM_HOOK_LOOP) which handles keyboard interrupt during bytecode execution, but in a very inefficient way. Without it enabled the times are:

System freq: 398.7 MHz
Add test:  2.834 s
Mul test:  3.183 s
Div test:  3.072 s
Pi test:  4.435 s
System freq: 806.0 MHz
Add test:  1.401 s
Mul test:  1.573 s
Div test:  1.471 s
Pi test:  2.173 s

I'll try to implement this hook in a different, more efficient, way. Also some other optimization can be done to improve the performance...

robert-hh commented 5 years ago

OK. Minor differences do not matter. The fastest device with that test is anyhow the PyBoard, mostly due to the traditional internal RAM. But at 800MHz, the K210 comes close, but at much higher power consumption.

loboris commented 5 years ago

I've tested a different approach for keyboard interrupt and it now works without impacting the performance. I'll test it some more, and commit it tommorow.

BTW, the floating point operations are faster than on PyBoard, especially considering the fp operations are double precission...

robert-hh commented 5 years ago

I've noticed that. The Pi test which makes heavy use of floats runs much faster. And there are more goodies in having 64 bit ints and double floats. About the change: do not hurry. I just got another task to accomplish form my daughter. That's topmost priority.