Slow memory leak - Githubissues

imbuedhope commented 3 years ago

I've been hunting for a memory leak that came to my attention after running my code for over 24 hours and got stuck after isolating it to the joulecope module. I can consistently reproduce the behavior. I've been testing with 9 joulescopes (plugged into a Raspberry Pi 4 with 4GB of RAM); it may be a lot less apparent with fewer devices.

The code that I managed to reproduce the behavior with is as follows.

#!/usr/bin/env python3

import time
import joulescope

def _handler(*args, **kwargs):
    pass

if __name__ == "__main__":
    scopes = []
    sampling_frequency = 'auto'
    buffer_duration = 1

    try:
        while True:
            scopes, added, removed = joulescope.scan_for_changes(
                name='Joulescope',
                devices=scopes,
                config='auto',
            )

            for dev in added:
                # start logging data
                dev.parameter_set('sampling_frequency', sampling_frequency)
                dev.parameter_set('buffer_duration', buffer_duration)

                dev.statistics_callback_register(
                    _handler,
                    'sensor'
                )

                try:
                    dev.open()
                except Exception:
                    continue

                print(
                    f'Started logging Joulescope-{dev.device_serial_number}',
                )

            for dev in removed:
                print(
                    f'Joulescope-{dev.device_serial_number} disconnected',
                )

            for dev in scopes:
                try:
                    dev.status()
                except Exception:
                    continue

            time.sleep(0.1)
    except KeyboardInterrupt:
        print('Closing gracefully...')

        for scope in scopes:
            scope.close()

After starting the process I tracked %mem usage over time at a few random intervals with ps with the following results.

(venv) $ ./isolate.py > /dev/null &
[1] 15889
(venv) $ date && ps -p 15889 -o %cpu,%mem,cmd
Thu Mar 25 05:15:17 UTC 2021
%CPU %MEM CMD
20.8  7.4 python3 ./isolate.py
(venv) $ ps -p 15889 -o %cpu,%mem,cmd
%CPU %MEM CMD
20.6  7.4 python3 ./isolate.py
(venv) $ ps -p 15889 -o %cpu,%mem,cmd
%CPU %MEM CMD
20.2  7.7 python3 ./isolate.py
(venv) $ date && ps -p 15889 -o %cpu,%mem,cmd
Thu Mar 25 05:18:56 UTC 2021
%CPU %MEM CMD
20.1  7.7 python3 ./isolate.py
(venv) $ date && ps -p 15889 -o %cpu,%mem,cmd
Thu Mar 25 05:21:30 UTC 2021
%CPU %MEM CMD
20.0  7.9 python3 ./isolate.py
(venv) $ date && ps -p 15889 -o %cpu,%mem,cmd
Thu Mar 25 05:22:49 UTC 2021
%CPU %MEM CMD
20.0  8.0 python3 ./isolate.py
(venv) $ kill -SIGTERM 15889
[1]+  Terminated              ./isolate.py > /dev/null

In the code I was originally trying to debug the last output from the script was as follows; we had 10 joulescopes connected to it when this happened. (This output is from journalctl since the script was launched with systemd)

Mar 20 02:35:23 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:35:59 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:41:50 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:42:01 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:42:23 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:42:27 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:43:56 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:44:23 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:44:32 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:44:53 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:45:01 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:45:19 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:46:20 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:46:43 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:48:04 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:48:17 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 02:51:38 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 02:51:40 ubuntu overnight_logger[35289]: response queue not empty
Mar 20 03:06:06 ubuntu overnight_logger[35289]: device thread hung: control_transfer_in - FORCE CLOSE
Mar 20 03:06:11 ubuntu overnight_logger[35289]: response queue not empty

I'm not entirely sure how to isolate the leak further or what the underlying cause is. Any help resolving this would be very appreciated.

mliberty1 commented 3 years ago

Thank you @imbuedHope for the code to reproduce this issue. The code looks correct for gathering the on-instrument statistics.

I modified this code to include gc.collect() just before time.sleep(0.1) just to make sure that the garbage collector was behaving.

I first ran this code on Windows with a single Joulescope attached. I found that Win 10x64 stabilized at 48,828 kB RAM before dropping and stabilizing at 19,960 kB.

However, the execution under Ubuntu 64 continued to grow. I used your approach to print the date & size, but using:

date && ps -p 151963 -o %cpu,size,cmd

Here are the results:

Time	Memory (kB)
9:10:45	170340
9:11:58	173828
9:13:23	177800
9:16:18	185812
9:17:41	189852
9:20:44	198176
9:24:22	208412
9:30:48	226748

This could be exponential, but we'll need more data to be sure. Let's use mprof.

pip3 install -U memory_profiler

And then:

mprof run isolate.py

To display the results:

mprof plot --slope mprofile_{your_timestamp}.dat

Which gives:

isolate

That's nice a linear 44 kB/s leak, on average. In 24 hours, that amounts to 3.8 GB.

So, I confirm that the statistics data collection has a problem on Linux. I will continue to investigate.

mliberty1 commented 3 years ago

I did more testing.

Here is mprof from Win10 x64 isolate_win10x64

And here is mprof from mac 11 x64 isolate_mac11x64

I further isolated the memory leak. It is not occurring in statistics(), but rather scan. Here is an updated example:

#!/usr/bin/env python3

import time
import joulescope
import gc

if __name__ == "__main__":
    scopes = []
    try:
        while True:
            scopes, added, removed = joulescope.scan_for_changes(
                name='Joulescope',
                devices=scopes,
                config='auto',
            )
            gc.collect()
            time.sleep(0.1)

When run on Ubuntu x64, this mprof captures: isolate2

So same python code means the leak is likely not in Python. Same libusb interface code using ctypes on macOS and Linux, so likely not in the Joulescope driver's usage of libusb. Therefore, the most likely location for this issue is within the Ubuntu implementation of libusb. Ugh.

In the meantime, the easiest workaround is to scan less often, especially since nothing should be changing. If once a minute is sufficient, you can reduce the leaking memory by a factor of 600.

mliberty1 commented 3 years ago

A possible solution is to only scan when required. Check out joulescope.usb.DeviceNotify and the libusb backend implementation.

mliberty1 commented 4 months ago

The joulescope_driver backend (pyjoulescope v1 backend) uses hotplug events only and does not respond to scan. Even if there is still a problem with the Ubuntu libusb implementation, the leak should be much, much less. Also, reducing the frequency of scan was an effective enough workaout.

jetperch / pyjoulescope

Slow memory leak #23