pycom / pycom-micropython-sigfox

A fork of MicroPython with the ESP32 port customized to run on Pycom's IoT multi-network modules.
MIT License
199 stars 167 forks source link

How to debug Loosing 100ms every 29 seconds #404

Open dane-if3 opened 4 years ago

dane-if3 commented 4 years ago

WiPy 2.0, 1.20.2.rc3-0.8.0-vanilla-squirrel

I have a machine that runs a main loop. The loop usually takes less than 50ms and I sleep the difference to make it 50ms Every 29 seconds the loop takes around 100ms longer. I have removed threads from the code. I have disabled interrupts where possible. I have timed the main parts to find where this is happening. I have noticed that it doesnt happen in the same pice of code. it occurs in all areas with a probability that generally matches the average run time of the portion of code. Any idea what this is or how to debug it or is it something I have to live with?

Below is the loop code and output. The last item in the long loop time is ticks_ms the min, max and average interval betweein long loops is 28.236, 29.322, 29.065 in seconds the first number is the total loop time, Sensor, State, PID, Modbus are the times for the different code areas. You can see that the 100ms moves around from section to section. Because a lot was occurring in the sensors I timed the read of each and the remaining part reports the sensor read times. you can see the 100ms occurs in different sensors. (For completeness the force and displacement sensors, f & Si are SPI read once and the 2 temps and displacement sensor, T1 T2 S, are internal ADC read several times and averaged)

I am not sure where to go from here, I found it chasing a different problem with sockets not responding but this appears to be unrelated. If I cant fix/control it I need to slow the machine down to account for a 3x longer control loop. ie if I am applying a force I could crush the object before I can measure the sensor again.

Main loop: `try: while not E_Stop():

wdt.feed()

    ### Don't cycle quicker than the minimum loop time.
    t = utime.ticks_ms()
    looptime_period = utime.ticks_diff(t,looptime_last)
    looptime_last = t
    if looptime_period > 2*looptime_min:
        print('long loop {}. Sensor {}, State {}, PID {}, Modbus {}. Sensor times {} {}'.format(looptime_period, sensorUpdateTime,stateUpdateTime,pidUpdateTime,ModbusUpdateTime, test_machine.sensors.ts, t))
    utime.sleep_ms(max(1,looptime_min-looptime_period))

    irq_state=machine.disable_irq()
    ### Update the sensor data
    sensorUpdateTime=-utime.ticks_ms()
    with a_lock:  #print("a_lock is locked while this executes")
        test_machine.sensors.update() # will throw a hard limit error finally can sort it out at the momnet need to handle this proprerly
    sensorUpdateTime+=utime.ticks_ms()

    ### Update/Run the active state
    stateUpdateTime=-utime.ticks_ms()
    if not procedure_manager.runningPocedure.update_state(): break
    stateUpdateTime+=utime.ticks_ms()

    ### Update the controllers
    pidUpdateTime=-utime.ticks_ms()
    test_machine.heater.update()
    test_machine.actuator.update()
    pidUpdateTime+=utime.ticks_ms()
    machine.enable_irq(irq_state)

    ### Output Modbus data
    ModbusUpdateTime=-utime.ticks_ms()
    server.poll(5)
    ModbusUpdateTime+=utime.ticks_ms()

    if i % 500==0:
        print('Active Connections:',server.active_connections)
    i=i+1

Output with and with out Modbus connection but machine idle (reading sensors but PID loops inactive) INFO:None: Starting Modbus Server INFO:None: Where is the EStop button INFO:None: Entering Main Loop long loop 132. Sensor 122, State 1, PID 0, Modbus 7. Sensor times {'f': 115, 'T2': 2, 'Si': 2, 'T1': 1, 'S': 2} 47514 Active Connections: [] New connection Client address:('192.168.xxx.xxx', 55325) socket: Active Connections: [[, 70044]] long loop 147. Sensor 124, State 0, PID 1, Modbus 21. Sensor times {'f': 3, 'T2': 114, 'Si': 2, 'T1': 2, 'S': 2} 76337 Active Connections: [[, 90390]] long loop 144. Sensor 123, State 0, PID 1, Modbus 6. Sensor times {'f': 2, 'T2': 1, 'Si': 114, 'T1': 2, 'S': 1} 104624 Connection closed due to inactivity Active Connections: [] New connection Client address:('192.168.xxx.xxx', 55326) socket: Active Connections: [[, 128630]] long loop 151. Sensor 123, State 0, PID 1, Modbus 20. Sensor times {'f': 2, 'T2': 1, 'Si': 114, 'T1': 2, 'S': 2} 133314 Active Connections: [[, 148945]] Connection closed due to inactivity long loop 164. Sensor 122, State 1, PID 0, Modbus 6. Sensor times {'f': 2, 'T2': 2, 'Si': 2, 'T1': 1, 'S': 114} 161991 Active Connections: [] New connection Client address:('192.168.xxx.xxx', 55327) socket: Active Connections: [[, 185983]] long loop 182. Sensor 10, State 0, PID 0, Modbus 143. Sensor times {'f': 3, 'T2': 1, 'Si': 2, 'T1': 2, 'S': 1} 190457 Active Connections: [[, 206432]] long loop 178. Sensor 125, State 0, PID 1, Modbus 20. Sensor times {'f': 2, 'T2': 2, 'Si': 2, 'T1': 1, 'S': 117} 218693 Active Connections: [[, 226536]] long loop 159. Sensor 11, State 1, PID 0, Modbus 146. Sensor times {'f': 3, 'T2': 2, 'Si': 2, 'T1': 1, 'S': 2} 247033 Active Connections: [[, 247153]] Active Connections: [[, 267609]] long loop 178. Sensor 123, State 1, PID 0, Modbus 27. Sensor times {'f': 3, 'T2': 1, 'Si': 2, 'T1': 114, 'S': 2} 275340 Active Connections: [[, 288200]] long loop 148. Sensor 127, State 1, PID 0, Modbus 20. Sensor times {'f': 5, 'T2': 116, 'Si': 1, 'T1': 1, 'S': 2} 303591 Active Connections: [[, 308659]] Active Connections: [[, 329192]] long loop 178. Sensor 10, State 0, PID 113, Modbus 28. Sensor times {'f': 3, 'T2': 1, 'Si': 2, 'T1': 2, 'S': 1} 331911 Active Connections: [[, 349472]] [Errno 104] ECONNRESET long loop 137. Sensor 123, State 0, PID 1, Modbus 5. Sensor times {'f': 115, 'T2': 1, 'Si': 2, 'T1': 2, 'S': 1} 360276 Active Connections: [] Active Connections: [] long loop 133. Sensor 122, State 1, PID 0, Modbus 7. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 114, 'S': 2} 389571 Active Connections: [] Active Connections: [] long loop 152. Sensor 122, State 1, PID 0, Modbus 6. Sensor times {'f': 2, 'T2': 114, 'Si': 2, 'T1': 2, 'S': 1} 418893 Active Connections: [] long loop 149. Sensor 122, State 1, PID 0, Modbus 5. Sensor times {'f': 2, 'T2': 2, 'Si': 2, 'T1': 1, 'S': 2} 448177 Active Connections: [] Active Connections: [] long loop 151. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 114, 'S': 2} 477496 Active Connections: [] Active Connections: [] long loop 148. Sensor 122, State 1, PID 0, Modbus 7. Sensor times {'f': 3, 'T2': 114, 'Si': 2, 'T1': 1, 'S': 2} 506785 Active Connections: [] Active Connections: [] long loop 154. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 2, 'T2': 2, 'Si': 1, 'T1': 2, 'S': 113} 536068 Active Connections: [] long loop 133. Sensor 122, State 0, PID 0, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 113, 'T1': 2, 'S': 1} 565379 Active Connections: [] Active Connections: [] long loop 149. Sensor 10, State 0, PID 112, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 2, 'S': 1} 594669 Active Connections: [] Active Connections: [] long loop 153. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 114, 'S': 2} 623961 Active Connections: [] Active Connections: [] long loop 138. Sensor 122, State 1, PID 0, Modbus 6. Sensor times {'f': 3, 'T2': 114, 'Si': 2, 'T1': 1, 'S': 1} 653257 Active Connections: [] Active Connections: [] long loop 137. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 2, 'T2': 2, 'Si': 2, 'T1': 1, 'S': 114} 682482 Active Connections: [] long loop 150. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 2, 'T2': 2, 'Si': 113, 'T1': 2, 'S': 2} 711773 Active Connections: [] Active Connections: [] long loop 143. Sensor 10, State 1, PID 112, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 2, 'S': 2} 741034 Active Connections: [] Active Connections: [] long loop 136. Sensor 123, State 1, PID 0, Modbus 5. Sensor times {'f': 3, 'T2': 1, 'Si': 2, 'T1': 114, 'S': 2} 770328 Active Connections: [] Active Connections: [] long loop 145. Sensor 123, State 1, PID 0, Modbus 6. Sensor times {'f': 3, 'T2': 115, 'Si': 2, 'T1': 2, 'S': 1} 799613 Active Connections: [] long loop 132. Sensor 121, State 1, PID 0, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 2, 'S': 113} 828854 Active Connections: [] Active Connections: [] long loop 134. Sensor 122, State 1, PID 0, Modbus 7. Sensor times {'f': 3, 'T2': 2, 'Si': 113, 'T1': 2, 'S': 2} 858153 Active Connections: [] Active Connections: [] long loop 141. Sensor 9, State 1, PID 112, Modbus 8. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 2, 'S': 1} 887419 Active Connections: [] Active Connections: [] long loop 141. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 114, 'S': 2} 916702 Active Connections: [] long loop 140. Sensor 123, State 0, PID 1, Modbus 5. Sensor times {'f': 3, 'T2': 114, 'Si': 2, 'T1': 2, 'S': 1} 945993 Active Connections: [] Active Connections: [] long loop 148. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 2, 'T2': 2, 'Si': 1, 'T1': 1, 'S': 114} 975257 Active Connections: [] Active Connections: [] long loop 156. Sensor 122, State 1, PID 0, Modbus 5. Sensor times {'f': 2, 'T2': 2, 'Si': 113, 'T1': 2, 'S': 2} 1004544 Active Connections: [] Active Connections: [] long loop 148. Sensor 10, State 0, PID 1, Modbus 118. Sensor times {'f': 2, 'T2': 2, 'Si': 1, 'T1': 1, 'S': 2} 1033816 Active Connections: [] Active Connections: [] long loop 145. Sensor 123, State 0, PID 1, Modbus 5. Sensor times {'f': 4, 'T2': 1, 'Si': 2, 'T1': 114, 'S': 2} 1063094 Active Connections: [] long loop 148. Sensor 124, State 0, PID 1, Modbus 5. Sensor times {'f': 3, 'T2': 115, 'Si': 2, 'T1': 2, 'S': 1} 1092368 Active Connections: [] Active Connections: [] long loop 144. Sensor 121, State 1, PID 0, Modbus 6. Sensor times {'f': 3, 'T2': 1, 'Si': 1, 'T1': 2, 'S': 113} 1121626 Active Connections: [] Active Connections: [] long loop 137. Sensor 125, State 0, PID 1, Modbus 5. Sensor times {'f': 5, 'T2': 1, 'Si': 115, 'T1': 2, 'S': 2} 1150944 Active Connections: [] Active Connections: [] long loop 141. Sensor 10, State 0, PID 0, Modbus 120. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 2, 'S': 1} 1180195 Active Connections: [] long loop 141. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 1, 'T1': 114, 'S': 2} 1209482 Active Connections: [] Active Connections: [] long loop 138. Sensor 130, State 1, PID 0, Modbus 5. Sensor times {'f': 11, 'T2': 114, 'Si': 2, 'T1': 1, 'S': 1} 1238773 Active Connections: [] Active Connections: [] long loop 152. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 2, 'T2': 2, 'Si': 2, 'T1': 1, 'S': 113} 1268020 Active Connections: [] Active Connections: [] long loop 139. Sensor 122, State 0, PID 1, Modbus 5. Sensor times {'f': 3, 'T2': 2, 'Si': 113, 'T1': 2, 'S': 1} 1297330 Active Connections: [] `

amotl commented 4 years ago

Dear @dane-if3,

while that might happen due to various reasons, I see that you tried hard to mitigate a bunch of them already. However, one specific thing comes to mind: The garbage collector.

You might want to experiment controlling the garbage collector [1] by disabling it using gc.disable() and invoke it manually within each cycle using gc.collect().

With kind regards, Andreas.

[1] https://docs.micropython.org/en/latest/library/gc.html

dane-if3 commented 4 years ago

Thanks Andreas, I will try it out on Monday and report back

dane-if3 commented 4 years ago

@amotl, You were correct it is the garbage collector. And it is slow so calling it manually more frequently makes the problem worse. Eg each gc.collect call in the code below takes 69ms with 32 bytes to clear. My other code takes 110ms to free 2.5M Checking to see if it should be called is not an option as gc.mem_free take 67ms a=-utime.ticks_ms() gc.mem_free() a+=utime.ticks_ms() print(a)

I thought del was supposed to deallocate the memory and also that the memory would be deallocated when the reference count is 0. It doesn't seem to work like that. It doesn't come back until garbage collection runs.

When I read a sensor, I read the raw value several times into a list then process the list to remove some noise. Each time I call read I create a list which I thought deallocated when the function returned. Am I right in saying that the memory will not come back untill gc.collect() runs and there is no other way to deallocate memory?

If I call read on 5 sensors every 50ms it takes 29 seconds to fill the 2.5MB of free memory I have available. I can have it back but it will cost 100+ms. The machine is applying force to samples. If the object is rigid having a loop of 3x normal screws up the control system and can result in an accidental overload. Options maybe to stop the actuator periodically and run gc.collect() but I need to guess when because gc.mem_free is also slow or reuse instance variables instead of ones local to the function or loops or as I only have a single thread maybe class level memory space that can be reused.

` def test(): raw = [] del raw

print(raw)

if name == "main": import gc, utime m1=0 m2=0 m3=0 while True: gc.collect() # clear any residual garbage m1 = gc.mem_free() # record free memory test() # call a method with local scope vairables m2 = gc.mem_free() # record free memory (should be the same as m1) gc.collect() # collect garbage (shouldn't be anything to do) m3 = gc.mem_free() # record free memory (should be the same as m1) print(m2-m1,m3-m1, (m1,m2,m3)) # change in free memory due to running method, free memory after garbage collection, (m1,m2,m3) utime.sleep_ms(1000) # comment out the gc.collect lines and watch the memory leak away `

amotl commented 4 years ago

Hi again,

you seem to be doing heavy things on MicroPython/ESP32. We learned the ESP32 is not the fastest piece on earth when it comes to RAM access performance. Seeing your demands on running a controller over an extended period of time under respective time constraints, you probably should not allocate/deallocate any memory at all within your control loop but go for static memory allocation.

I've also attached some curated links which you might find interesting to read through.

With kind regards, Andreas.

robert-hh commented 4 years ago

When doing some actions in a timed fashion, using delays in a loop seem not appropriate to me, because timing errors pile up. Better use a timer callback, which executes your function in a regular period. Even if the callbacks have some jitter in the start time, the basic clock interval is stable, and a short delay in one activation will not interfere with the following ones. In my tests, the jitter was <500µs for a device with SPIRAM and <200µs for a device w/o SPIRAM.

dane-if3 commented 4 years ago

@amotl Thank you for your time and patience. I had not read some of the information you have provided and I was unaware of the OpenMV project. I see I have a misunderstanding of how things work and I have several places I can improve efficiency. However several documents referenced the time it takes to run garbage collection:

On the Wipy 3 and FiPy the below takes 33ms, 40ms gc.collect() # clear any residual garbage t1=-utime.ticks_ms() gc.collect() # time a collect with no garbage to collect print(t1+utime.ticks_ms()) If there is more garbage it takes longer. When the memory is full it can be well over 100ms This is a lot longer than 1ms to several ms quoted in the reference docs. Maybe the info in the docs is old and the boards used had much less ram than the pycom bords I have and they didn't have SPIRAM. Perhaps gc doesn't scale well as ram grows? Shoud gc take this long or is there a bug?

@robert-hh I agree with you. I was using timers, threads and interrupts but I had issues with sockets I was (am still) trying to debug so I simplified the code to make it easier to figure out. I agree with what you are saying and your jitter measurements are encouraging. However I don't think using timers will help this particular problem. My current understanding is that because I am not using static memory for functions in my control loops I am allocating small amounts of new memory every time I call them. (I utilise a lot of small lists and iterators and list compressions) Despite them all going out of scope the memory is not deallocated until garbage collection is run. These functions are called a lot, so every 29 seconds the available memory runs out and before the next variable can be initialised garbage collection runs. This takes about 100ms and effectively occurs randomly somewhere mid function in the control loop. (This is the intended behaviour for gc that I didn't fully understand) The effect on loop timing is, the loop where the garbage collection runs will last 100ms longer than the others. Depending on where in the loop gc runs, it could be before or after the PIDs have been calculated and/or the PWM duty cycle has been set. It may work out that the loop ran for 150ms with the integral and derivative PID term calculated with a 50ms loop expected followed by a 50ms loop with PID terms calculated on a 150ms expected loop. (I have been using the last loop length as the dt for the PID rather than fixing it. I could halve the problem by fixing it to 50ms) Mostly this effect is not noticeable except when it coincides with applying a force by a linear transducer. if it happens near the limit or while the force is being maintained it can cause overshoot. This is possibly contributing to a test piece failing the test due to excessive deformation. As I have come to understand the problem better I can see there are many things I should be doing that would mitigate it. If I can resolve my other socket bug I may well return to timers. I would like to know if the gc.collect times I am seeing are reasonable or if not how I can help narrow down the problem

Again thank you both for your help and patience I am still new to micropython and learning a lot. Regards, Dane