Memory leaks when rerunning cell that opens a file

marimo-team / marimo

A reactive notebook for Python — run reproducible experiments, execute as a script, deploy as an app, and version with git.

https://marimo.io

Apache License 2.0

8k stars 284 forks source link

Memory leaks when rerunning cell that opens a file #2815

Closed wasimsandhu closed 2 weeks ago

wasimsandhu commented 2 weeks ago

Describe the bug

When I rerun a cell that opens a large file (in this case around 500 MB), I'm noticing that the memory usage is increasing. These screenshots are taken after each rerun.

Environment

{
  "marimo": "0.9.15",
  "OS": "Darwin",
  "OS Version": "23.4.0",
  "Processor": "i386",
  "Python Version": "3.12.4",
  "Binaries": {
    "Browser": "130.0.6723.117",
    "Node": "v20.18.0"
  },
  "Dependencies": {
    "click": "8.1.7",
    "docutils": "0.21.2",
    "itsdangerous": "2.2.0",
    "jedi": "0.19.1",
    "markdown": "3.7",
    "narwhals": "1.13.2",
    "packaging": "24.1",
    "psutil": "6.1.0",
    "pygments": "2.18.0",
    "pymdown-extensions": "10.12",
    "pyyaml": "6.0.2",
    "ruff": "0.4.4",
    "starlette": "0.41.2",
    "tomlkit": "0.13.2",
    "typing-extensions": "4.12.2",
    "uvicorn": "0.32.0",
    "websockets": "12.0"
  },
  "Optional Dependencies": {
    "pandas": "2.2.3"
  }
}

Code to reproduce

with open("file.json") as _file:
    _data = json.load(_file)
_df = pd.DataFrame(_data["results"])
mo.ui.table(_df)

akshayka commented 2 weeks ago

Whoa interesting. If you change all local variables to global variables (remove underscores), do you still see the leak?

We previously had a leak with locals, which I thought I fixed but maybe I missed an edge case.

akshayka commented 2 weeks ago

Another thing to try if you don't mind: Can you remove the last line (mo.ui.table(_df)) and check if that gets rid of the leak?

wasimsandhu commented 2 weeks ago

Tried both your suggestions (first changing to global variables, then removing the last line). Unfortunately, the memory leak is still reproducible.

akshayka commented 2 weeks ago

Will look into it. Weird request, but last try: Can you remove the line that reads data into a dataframe?

akshayka commented 2 weeks ago

I tried to reproduce it but couldn't — RAM fluctuated but it didn't increase unbounded. I would consider this a leak if when you keep on re-running the cell, you can make the memory usage go arbitrarily high (can you?).

The operating system doesn't always reclaim memory used by a Python process, even after Python's garbage collection runs. So after doing a large allocation and "freeing" it (variable going out of scope, explicit del), the Python process's memory usage may still be higher than what it originally was. When I tried experimenting with similar code to yours just now, this was the kind of behavior I saw.

wasimsandhu commented 2 weeks ago

Yes, the memory usage continues to increase. After rerunning 30 times, the memory usage spikes up from 1 GB to 2.5 GB, and does not fluctuate back down. I even added an explicit delete and garbage collection, to no avail.

This has made my work somewhat tedious because this file is 1 of many which I am loading in the notebook, and I am constantly having to restart the kernel because of the memory usage.

As to your second note, I guess the best way to test this would be to reproduce similar behavior in a jupyter notebook? I don't have time to at the moment, but thanks for looking into this issue so promptly!

wasimsandhu commented 2 weeks ago

Actually you might be right. Maybe there is a ceiling to how high the memory can increase. When loading a much larger dataset (around 15 GB), I noticed that my memory usage spikes up pretty high after rerunning but then drops back down after a while.

Feel free to close if you can't reproduce.