pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.83k forks source link

Read_csv leaks memory when used in multiple threads #19941

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

When using read_csv in threads it appears that the Python process leaks a little memory.

This is coming from this dask-focused stack overflow question: https://stackoverflow.com/questions/48954080/why-is-dask-read-csv-from-s3-keeping-so-much-memory

I've reduced it to a problem with pandas.read_csv and a concurrent.futures.ThreadPoolExecutor

Code Sample, a copy-pastable example if possible

# imports
import pandas as pd
import numpy as np
import time
import psutil
from concurrent.futures import ThreadPoolExecutor

# prep
process = psutil.Process()
e = ThreadPoolExecutor(8)
# prepare csv file, only need to run once
pd.DataFrame(np.random.random((100000, 50))).to_csv('large_random.csv')
# baseline computation making pandas dataframes with threasds.  This works fine

def f(_):
    return pd.DataFrame(np.random.random((1000000, 50)))

print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(f, range(8)))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 57.0 MB
after: 56.0 MB
# example with read_csv, this leaks memory
print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 58.0 MB
after: 323.0 MB

Output of pd.show_versions()

In [2]: pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.22.0 pytest: 3.3.2 pip: 9.0.1 setuptools: 38.4.0 Cython: 0.28a0 numpy: 1.14.1 scipy: 0.19.0 pyarrow: 0.8.0 xarray: 0.8.2-264-g0b2424a IPython: 5.1.0 sphinx: 1.6.5 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.3 blosc: 1.5.1 bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: None sqlalchemy: 1.2.1 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: 0.0.9 fastparquet: 0.1.4 pandas_gbq: None pandas_datareader: None
chris-b1 commented 6 years ago

I was able to repro the original SO question, but not your example, for what that's worth - only difference seems to be I'm on win64.

print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 125.0 MB
after: 129.0 MB

pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.22.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 38.5.1 Cython: 0.25.2 numpy: 1.14.0 scipy: 1.0.0 pyarrow: 0.7.1 xarray: 0.9.6 IPython: 6.1.0 sphinx: 1.6.3 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.1.2 openpyxl: 2.4.10 xlrd: 1.0.0 xlwt: None xlsxwriter: 0.9.6 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: 0.1.1 fastparquet: 0.1.0 pandas_gbq: None pandas_datareader: 0.5.0
jeremycg commented 6 years ago

Thanks a million for tracking this down (I was the asker of the original SO question)!

I can repeat this on my setup: before: 67.0 MB after: 66.0 MB before: 66.0 MB after: 297.0 MB

But, it looks like this is not the problem in the original question - using this modification on my real data, it fixes the problem with dask:

initial memory: 68.98046875 data in memory: 11390.87109375 data frame usage: 11079.813480377197 After function call: 11649.90234375

I'll make a reproducible example, and file this against dask

pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.77-31.58.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.1 pytest: None pip: 9.0.1 setuptools: 27.2.0 Cython: 0.26 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: 0.4.0 matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: 1.1.9 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: 0.1.2 pandas_gbq: None pandas_datareader: None
birdsarah commented 6 years ago

I am using read_parquet (via dask) and also have things lurking around in memory.

mrocklin commented 6 years ago

@birdsarah it would be useful to isolate this problem to either dask or pandas by running your computation under both the single-threaded scheduler and the multi-threaded scheduler

dask.set_options(get=dask.local.get_sync)
dask.set_options(get=dask.threaded.get)

And then measure the amount of memory that your process is taking up

import psutil
psutil.Process().memory_info().rss

If it takes up a fair amount of memory when using the threaded scheduler but not when using the single threaded scheduler then I think it would be likely that we could isolate this to Pandas memory management.

mrocklin commented 6 years ago

Assuming you have time of course, which I realize may be a large assumption.

birdsarah commented 6 years ago

Think this worked. Here's a gist. Threaded seems to take up more memory. https://gist.github.com/birdsarah/ea0b4978f25f0bb1e2389cd04b4bf287

kuraga commented 6 years ago

I don't know if it's the same but:

# 400 MB RAM usage (in htop, all the system)
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 2 GB
del df
gc.collect()
# 1.15 GB
mrocklin commented 6 years ago

Thanks @kuraga . This example would be more useful if people here could reproduce it easily, ideally without downloading a particular file. Are you able to create a self-contained example, similar to the example given in the original post that demonstrates this issue?

kuraga commented 6 years ago

@mrocklin Hm... Seems like I've found a magic line...

with open('df.csv', 'wt') as f:
    f.write('item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability')
    for n in range(4000):
        f.write("""ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ray, USB. Если настроить, то работает смарт тв /
Торг",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a8713f112c67e29bb42,3032.0,0.43177""")
import pandas as pd
df = pd.read_csv('df.csv')

import gc
del df
gc.collect()

And reading is slow...

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.14.19-calculate machine: x86_64 processor: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz byteorder: little LC_ALL: None LANG: ru_RU.utf8 LOCALE: ru_RU.UTF-8 pandas: 0.23.0 pytest: None pip: 10.0.1 setuptools: 39.1.0 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
little-eyes commented 6 years ago

same problem running in a docker container to load 14gb data, however, it exceeds my 64gb memory limit very quickly..

vanerpool commented 6 years ago

also have same problem as @little-eyes Docker + 12GB data

# 80 MB RAM usage
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 12.6 GB
del df
gc.collect()
# 6.1 GB

pandas: 0.23.1, docker: 17.12.1-ce

birdsarah commented 6 years ago

@mrocklin, I was playing with this to see if I could track anything further down.

I noticed that if I run without multithreading, I still appear to get a memory leak:

    process = psutil.Process()
    print('before:', process.memory_info().rss // 1e6, 'MB')
    for i in range(8):
        pd.read_csv(test_data, engine='python')
    time.sleep(2)
    print('after:', process.memory_info().rss // 1e6, 'MB')

(test_data is the csv written to disk by your original code)

Result 1 - engine='python':

before: 71.0 MB
after: 113.0 MB

Result 2 - engine='c':

before: 72.0 MB
after: 119.0 MB

This is on Linux (Fedora)

$ conda list pandas

# Name                    Version                   Build  Channel
pandas                    0.23.2           py36h04863e7_0 

Edit: This may be nothing. If I wait longer and garbage collect it seems to clear up.

birdsarah commented 6 years ago

Relevant discussion: https://github.com/dask/dask/issues/3530

setting MALLOC_MMAP_THRESHOLD_=16384 results in a significant improvement using the original code that @mrocklin posted.