pycom / pycom-micropython-sigfox

A fork of MicroPython with the ESP32 port customized to run on Pycom's IoT multi-network modules.
MIT License
198 stars 167 forks source link

Memory leak #52

Closed syndycat closed 7 years ago

syndycat commented 7 years ago

LOPY 1.7.9.b3. if you run the below code, you will see that after many loops it will raise "memory error". To reproduce in fastest way set SLEEP= 0.5. Same result is with gc.enable().

import gc
import utime
from machine import Timer

class Object:

    def __init__(self, count=0):
        print ("Object INIT ", self)
        self._timer = Timer.Alarm(self._timerCallback, 0.5, periodic=False)

    def __del__(self):
        print ("Object DEL ", self)

    def _timerCallback(self, timer):    
        print("Timer finish " ,  self)
        self._timer.cancel()
        del self._timer
        self._timer = None

print("Start memory %s" %(gc.mem_free()))
SLEEP = 1.5
gc.disable()
while True:
    obj = Object(100)

    utime.sleep(SLEEP)
    gc.collect()
    print("\t\tmemory %s" %(gc.mem_free()))
robert-hh commented 7 years ago

Hello @syndycat and @daniel. I followed the source up to the point where I found the reason for the memory drain. Whenever you register a callback., it will be added by mp_irq_handler_add() to a list of irq handler. There is no mp_irq_handler_remove() or the like, so the list only grows. You can verify that by just not setting a callback function in the creation of the timer. If it expires, it will deregister itself, so timer.cancel() is not needed. I started to look for a proper place to remove the irq handler, so let's see how far I get..

robert-hh commented 7 years ago

I have an initial working version stopping the memory drain. It has changes in three files esp32/mods/machtimer_alarm.c, esp32/util/mpirq.c and esp32/util/mpirq.h. It is essential to call timer.cancel() or timer.callback(None) to remove the callback handler.

import gc
import utime
from machine import Timer

class Object:

    def __init__(self, count=0):
        print ("Object INIT ", self)
        self._timer = Timer.Alarm(self._timerCallback, None, ms=500, periodic=False)

    def _timerCallback(self, timer):
        print("Timer finish " ,  self)
        self._timer.cancel()

gc.collect()
now = gc.mem_free()
print("Start memory %d" %(now))
SLEEP = 1.0
while True:
    obj = Object(100)

    utime.sleep(SLEEP)
    del obj
    gc.collect()
    then = gc.mem_free()
    print("\t\tmemory %d - %d" %(now, now - then))
    now = then

machtimer_alarm.zip

Update: I let it run for ~10 hours (36000 iterations), and the free memory just varied between 54016 and 53752 bytes.

robert-hh commented 7 years ago

I polished both a little bit, such that setting the callback a second time removes the old callback register too. That requires initialization of the object, both for pin and timer. I searched for other places, and found until now only similar code in the bluetooth and lora modules. I'm still not really convinced that this is needed, and if frequent reassigning of handlers is a popular use case. If yes, I'll file a pull request.

robert-hh commented 7 years ago

After thinking about it, it came to my mind, that this method causes another problem. There is just one list of IRQ handlers, which does not have duplicates. If the same handler is called twice for different purposes, and the first event removes the handler, then is is also lost for the other ones. if duplicates in the list are allowed, then instead a code setting the same callback over and over again would cause a memory drain. Conclusion: leave it as it is, and go for a Python approach of the same callback handler with different contexts. This is the sample code:

import gc
import utime
from machine import Timer

def timerCallback(obj):
    print("Timer callback", obj.value)
    obj._timer.cancel()

class Object:

    def __init__(self, count=0):
        print ("Object INIT ", self)
        self._timer = Timer.Alarm(timerCallback, None, ms=500, periodic=False, arg=self)

    def setValue(self, value):
        self.value = value

def run():
    gc.collect()
    now = gc.mem_free()
    print("Start memory %d" %(now))
    SLEEP = 1.0
    for j in range(200):
        obj = Object(100)
        obj.setValue(j)

        utime.sleep(SLEEP)
        gc.collect()
        then = gc.mem_free()
        print("\t\tmemory %d - %d" %(now, now - then))
        now = then
run()

The callback handler simply gets the class instance as argument. Just one handler is required, which still has access to a varying context. But: also this variant spoils memory, but that can be fixed with a very simple non-intrusive code fix. Here's the fix for mods/machtimer_alarm.c:

STATIC mp_obj_t alarm_delete(mp_obj_t self_in) {
    mp_obj_alarm_t *self = self_in;

    self->handler = NULL;
    self->handler_arg = NULL;
    uint32_t state = MICROPY_BEGIN_ATOMIC_SECTION();
    if (self->heap_index != -1) {
        remove_alarm(self->heap_index);
    }
    MICROPY_END_ATOMIC_SECTION(state);
    return mp_const_none;
}

And here the fix for mods/machpin.c:

STATIC void set_pin_callback_helper(mp_obj_t self_in, mp_obj_t handler, mp_obj_t handler_arg) {
    pin_obj_t *self = self_in;
    if (handler == mp_const_none) {
        self->handler = NULL;
        self->handler_arg = NULL;
        return;
    }

    self->handler = handler;

    if (handler_arg == mp_const_none) {
        handler_arg = self_in;
    }

    self->handler_arg = handler_arg;
}

In both cases just the link to handler and handler argument in the data structures is reset, allowing gc.collect() to dispose the previously linked objects.

robert-hh commented 7 years ago

Last not least, here's the python variant which the same as the above example without changing the firmware:

import gc
import utime
from machine import Timer

def timerCallback(self):
    print("Timer callback with context", self.value)

class Object:

    def __init__(self, count=0):
        print ("Object INIT ", self)
        self._timer = Timer.Alarm(timerCallback, None, ms=500, periodic=False, arg=self)

    def setValue(self, value):
        self.value = value

    def Stop(self):
        self._timer.callback(timerCallback, arg=0)
        self._timer.cancel()

def run():
    gc.collect()
    now = gc.mem_free()
    print("Start memory %d" %(now))
    SLEEP = 1.0
    for j in range(200):
        obj = Object(100)
        obj.setValue(j)

        utime.sleep(SLEEP)
        obj.Stop()
        #del obj
        gc.collect()
        then = gc.mem_free()
        print("\t\tmemory %d - %d" %(now, now - then))
        now = then
run()

The trick is the double call of _timer.callback() with a constant value (not None!) as arg followed by timer.cancel() in the method Stop(), which results in the same as the firmware change above. The first call frees the back-link to the object instance, the second call clears the link to the callback, making the timer object free to be discarded. And since the Object instance is provided as argument to the callback function, this has access to the Object instance internals. Not nice, but doable. With the change above the Python code looks a little bit more elegant.

g0hww commented 7 years ago

Hi. I'm having this issue with leaks. Amongst other code, I have used one-shot Timer.Alarm callbacks to turn the LED off after it has been turned on by one or more worker threads, other callbacks, or by one of the repeating Timer.Alarms.

I've not been sure about the cause of the leaks, but suspected the Timer.Alarms, as it has been a failure to allocate those which has been the first signs of impending doom. Interestingly, the running worker threads and repeating TimerAlarms have mostly continued to function for many hours after the oneshot Timer.Alarms could no longer be allocated.

I changed my main thread's idle loop, that simply garbage collected and slept, to hammer the following led_flash() method as fast as possible, and it can bring the lopy down to a point where it can't allocate 8192 bytes in 170 seconds. Without the mad LED flashing, it can stay up for an hour or two.

   def led_flash(self, value):
        with self.led_lock:
            if None != self.led_timer:
                self.led_timer.callback(self.led_cb, arg =0)
                self.led_timer.cancel()
                self.led_timer = None
            pycom.rgbled(value)
            self.led_timer = Timer.Alarm(self.led_cb, s=0.2,
                                         arg=False, periodic=False)

    def led_cb(self, state=False):
        with self.led_lock:
            if None != self.led_timer:
                self.led_timer.callback(self.led_cb, arg =0)
                self.led_timer.cancel()
                self.led_timer = None
            pycom.heartbeat(state)
        gc.collect()

The loop in main.py contains this:

            while True == node.is_running():
                node.led_flash(0xFF00FF)
                utime.sleep_ms(100)
                node.led_flash(0xFFCC99)
                utime.sleep_ms(500)
                gc.collect()

As you can see, I've attempted to follow the proposed solution described above, but it doesn't seem to be doing the trick for me. There may well be other leaks too, but it seems that the oneshot Timer.Alarm might be the biggy.

robert-hh commented 7 years ago

The canceling of the callback must not be done in the callback itself. That's why I added the stop method. And the callbacks must not reside in a class. About your code: Is is intentional that you call node.led_flash() a second time before the first timer expires?

danicampora commented 7 years ago

Thanks @robert-hh, we have this fixed and a new release along with the sources will be published tomorrow. We used a mix of @robert-hh's code and the approach we used on the cc3200 port. The leaks are now gone and you can have the callbacks inside a class. In short you'll be able to write the code as you wish without memory issues happening.

g0hww commented 7 years ago

@robert-hh Yes, that was intentional as, in accordance to Murphy's Law, my pair of LoPys more often than not choose to beacon at almost the same time, creating a rapid LED transition between red and green without it turning off in between, so cancelling the first callback was a good thing to test when hammering the LEDs to check for leaks.

@danicampora Cool beans :)

robert-hh commented 7 years ago

It looks like last night's update fixes the issues brought up by @syndycat . Maybe @g0hww can check with his code too.

g0hww commented 7 years ago

It looking good here :) Yesterday I temporarily replaced the use of the oneshot Timer.Alarm with a sleep and direct invocation of the callback, and monitored the freemem and allocmem for over 100k secs of uptime across both devices. The results were very stable.

Now I'm using this code:

    def led_flash(self, value):
        with self.led_lock:
            if self.led_timer != None:
                self.led_timer.cancel()
                self.led_timer = None
            pycom.rgbled(value)
            self.led_timer = Timer.Alarm(self.led_cb, s=0.1,
                                         arg=False, periodic=False)

    def led_cb(self, state=False):
        with self.led_lock:
            if self.led_timer != None:
                pycom.heartbeat(state)
                self.led_timer = None

With the head-ache inducing LED flashing torture test in my main loop, like so:

            while node.is_running():
                node.led_flash(0xFF00FF)
                utime.sleep_ms(50)
                node.led_flash(0xFFCC99)
                utime.sleep_ms(500)
                gc.collect()

and with over 12000 secs of uptime, I've seen no sign of any memory leakage, compared with the previous time to allocation failure of 170 secs when the bug was present. I'd say this is fixed. Thanks all :)

danicampora commented 7 years ago

Thanks guys, solved here: https://github.com/pycom/pycom-micropython-sigfox/commit/01cc316bff96fd42a8560f3de76c8fd11ce84c0c