online-ml / river

🌊 Online machine learning in Python
https://riverml.xyz
BSD 3-Clause "New" or "Revised" License
5.03k stars 540 forks source link

Pickle loaded model uses 10x the amount of RAM #1512

Closed jpfeil closed 6 months ago

jpfeil commented 6 months ago

Versions

river version: 0.21.0

Python version: 3.9.16

Describe your task

Loading a pretrained ARFClassifier

What kind of performance are you expecting?

Expected the memory footprint to be the same as the model size. The actual model is the correct size, but the loading of the model takes up all of my machine's RAM. So I'm looking for a way to free up the memory after loading the model.

Steps/code to reproduce

import pickle 
with open("model.pkl", 'rb') as f:
    m = pickle.load(f)

m._memory_usage
> '1.13 GB'

print('RAM Used (GB):', psutil.virtual_
    ...: memory()[3]/1000000000)
> RAM Used (GB): 13.649047552

Necessary data

gbolmier commented 6 months ago

Hey @jpfeil, would you mind providing a minimal snippet reproducing the observed behaviour on a toy dataset? If possible with a dataset from the datasets module

jpfeil commented 6 months ago

Hi @gbolmier

I haven't tried it with the datasets data because I think you need to make a pretty large model to see this effect. So I made some synthetic data that will give you an idea of what is happening. Basically, when the model is dumped, it takes a lot of memory to create the pickled model but the memory is eventually released. However, when the model is loaded, the memory spikes but is then not released, so I suspect there is a reference that is keeping the pickle VM around longer than needed. But I'm not sure.

from sklearn.datasets import make_classification
from river.forest import ARFClassifier
from tqdm import tqdm
import pickle
import psutil

X, y = make_classification(n_samples=1000,
                           n_features=1000,
                           n_informative=800,
                           n_clusters_per_class=100)

model = ARFClassifier(n_models=300)

for i in tqdm(range(X.shape[0])):

    xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
    yi = y[i]

    model.learn_one(xi, yi)

print(model._memory_usage)

with open("test.pkl", "wb") as f:
    pickle.dump(model, f)

Now start a new python session

initial_memory = psutil.virtual_memory().used
with open('test.pkl', 'rb') as f:                                                               
    rmodel = pickle.load(f)  

rmodel._memory_usage
> '739.71 MB'  

final_memory = psutil.virtual_memory().used
print('RAM Used (GB):', final_memory - start_memory /1000000000)
> RAM Used (GB): 5.968044032
gbolmier commented 6 months ago

From what I could find online I think this is actually due to how Python and the system manage memory usage.

See the below example (note that psutil.virtual_memory is system-wide so I added the current Python process Resident Set Size (rss) memory):

import pickle
import psutil
import sys
import random

from river.utils.pretty import humanize_bytes

def print_vmem_and_rss():
    vmem = humanize_bytes(psutil.virtual_memory().used)
    rss = humanize_bytes(psutil.Process().memory_info().rss)
    print(f"{vmem = } | {rss = }", end="\n\n")

print_vmem_and_rss()

print("1) Create a list of 100_000_000 random floats")
my_list = [random.random() for _ in range(100_000_000)]
print(f"{humanize_bytes(sys.getsizeof(my_list)) = }")
print_vmem_and_rss()

print("2) Dump `my_list` to disk")
pickle.dump(my_list, open("my_list.pickle", "wb"))
print_vmem_and_rss()

print("3) Load `my_list` from disk into `my_list2`")
my_list2 = pickle.load(open("my_list.pickle", "rb"))
print(f"{humanize_bytes(sys.getsizeof(my_list2)) = }")
print_vmem_and_rss()

which outputs:

vmem = '17.59 GB' | rss = '38.3 MB'

1) Create a list of 100_000_000 random floats
humanize_bytes(sys.getsizeof(my_list)) = '796.44 MB'
vmem = '20.85 GB' | rss = '3.09 GB'

2) Dump `my_list` to disk
vmem = '19.55 GB' | rss = '3.78 GB'

3) Load `my_list` from disk into `my_list2`
humanize_bytes(sys.getsizeof(my_list2)) = '785.06 MB'
vmem = '22.29 GB' | rss = '6.85 GB'

Despite the list object being less than 800 MB, memory usage increases by ≅ 3 GB on my machine when creating the list or loading it from disk. Not sure other formats would perform better in Python. Curious if someone wants to run such experiments and report here.

jpfeil commented 6 months ago

Thanks, @gbolmier! Yeah, I wonder if a simple type change could improve memory efficiency. I'm not familiar with the implementation code, but I wonder if using an array instead of a list when possible would lead to a significant improvement.

jpfeil commented 6 months ago

Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right? The training should look like case 1, but I don't see that with the river model. I see the expected memory usage.

jpfeil commented 6 months ago

I figured it out finally! The pickle VM sticks around to support memoization. Apparently, this is needed for recursive functions. So, as long as river doesn't use recursion, then you can set "fast" mode which does not set up memoization and the memory doesn't blow up.

with open("test-fast.pkl", "wb") as f:
    p = pickle.Pickler(f)
    p.fast = True
    p.dump(model)

I haven't tested whether this affects predictive performance, but this solves the memory issue.

gbolmier commented 6 months ago

Glad to hear! Neat trick indeed 😁 (Please be aware that it is deprecated though)

Also, I don't see the memory difference when I train the model, only when I pickle it. If it was just the list usage, then I should see the same memory usage when training, right?

This is because you're looking at the wrong memory usage metric. Resident memory instead, measures the Python process RAM usage. See the increase with this code:

import pickle
import psutil

from river.forest import ARFClassifier
from river.utils.pretty import humanize_bytes
from sklearn.datasets import make_classification

def print_vmem_and_rss():
    vmem = humanize_bytes(psutil.virtual_memory().used)
    rss = humanize_bytes(psutil.Process().memory_info().rss)
    print(f"{vmem = } | {rss = }", end="\n\n")

print_vmem_and_rss()

print("1) Create a dataset of 1_000 samples by 1_000 features")
X, y = make_classification(
    n_samples=1000,
    n_features=1000,
    n_informative=800,
    n_clusters_per_class=100,
)
print_vmem_and_rss()

print("2) Instantiate an ARF classifier")
model = ARFClassifier(n_models=300)
print(f"{model._memory_usage = }")
print_vmem_and_rss()

print("3) Train the ARF classifier on the created dataset")
for i in range(X.shape[0]):
    xi = dict((f"x{n}", v) for n, v in enumerate(X[i]))
    yi = y[i]
    model.learn_one(xi, yi)
print(f"{model._memory_usage = }")
print_vmem_and_rss()

print("4) Dump `model` to disk")
with open("test.pkl", "wb") as f:
    p = pickle.Pickler(f)
    p.fast = True
    p.dump(model)
print_vmem_and_rss()

print("5) Load `model` from disk into `model2`")
model2 = pickle.load(open("test.pkl", "rb"))
print(f"{model2._memory_usage = }")
print_vmem_and_rss()
vmem = '22.05 GB' | rss = '130.98 MB'

1) Create a dataset of 1_000 samples by 1_000 features
vmem = '22.18 GB' | rss = '269.94 MB'

2) Instantiate an ARF classifier
model._memory_usage = '0.99 MB'
vmem = '22.18 GB' | rss = '271.39 MB'

3) Train the ARF classifier on the created dataset
model._memory_usage = '1.06 GB'
vmem = '23.59 GB' | rss = '1.77 GB'

4) Dump `model` to disk
vmem = '23.01 GB' | rss = '1.77 GB'

5) Load `model` from disk into `model2`
model2._memory_usage = '1.12 GB'
vmem = '24.03 GB' | rss = '2.78 GB'

Note that the RAM usage increased by ≅ 500 MB more than the model size after training it.