Open mrocklin opened 6 years ago
I was able to repro the original SO question, but not your example, for what that's worth - only difference seems to be I'm on win64.
print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1) # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 125.0 MB
after: 129.0 MB
pd.show_versions()
Thanks a million for tracking this down (I was the asker of the original SO question)!
I can repeat this on my setup: before: 67.0 MB after: 66.0 MB before: 66.0 MB after: 297.0 MB
But, it looks like this is not the problem in the original question - using this modification on my real data, it fixes the problem with dask:
initial memory: 68.98046875 data in memory: 11390.87109375 data frame usage: 11079.813480377197 After function call: 11649.90234375
I'll make a reproducible example, and file this against dask
pd.show_versions()
I am using read_parquet (via dask) and also have things lurking around in memory.
@birdsarah it would be useful to isolate this problem to either dask or pandas by running your computation under both the single-threaded scheduler and the multi-threaded scheduler
dask.set_options(get=dask.local.get_sync)
dask.set_options(get=dask.threaded.get)
And then measure the amount of memory that your process is taking up
import psutil
psutil.Process().memory_info().rss
If it takes up a fair amount of memory when using the threaded scheduler but not when using the single threaded scheduler then I think it would be likely that we could isolate this to Pandas memory management.
Assuming you have time of course, which I realize may be a large assumption.
Think this worked. Here's a gist. Threaded seems to take up more memory. https://gist.github.com/birdsarah/ea0b4978f25f0bb1e2389cd04b4bf287
I don't know if it's the same but:
# 400 MB RAM usage (in htop, all the system)
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 2 GB
del df
gc.collect()
# 1.15 GB
Thanks @kuraga . This example would be more useful if people here could reproduce it easily, ideally without downloading a particular file. Are you able to create a self-contained example, similar to the example given in the original post that demonstrates this issue?
@mrocklin Hm... Seems like I've found a magic line...
with open('df.csv', 'wt') as f:
f.write('item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability')
for n in range(4000):
f.write("""ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ray, USB. Если настроить, то работает смарт тв /
Торг",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a8713f112c67e29bb42,3032.0,0.43177""")
import pandas as pd
df = pd.read_csv('df.csv')
import gc
del df
gc.collect()
And reading is slow...
same problem running in a docker container to load 14gb data, however, it exceeds my 64gb memory limit very quickly..
also have same problem as @little-eyes Docker + 12GB data
# 80 MB RAM usage
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 12.6 GB
del df
gc.collect()
# 6.1 GB
pandas: 0.23.1, docker: 17.12.1-ce
@mrocklin, I was playing with this to see if I could track anything further down.
I noticed that if I run without multithreading, I still appear to get a memory leak:
process = psutil.Process()
print('before:', process.memory_info().rss // 1e6, 'MB')
for i in range(8):
pd.read_csv(test_data, engine='python')
time.sleep(2)
print('after:', process.memory_info().rss // 1e6, 'MB')
(test_data is the csv written to disk by your original code)
Result 1 - engine='python'
:
before: 71.0 MB
after: 113.0 MB
Result 2 - engine='c'
:
before: 72.0 MB
after: 119.0 MB
This is on Linux (Fedora)
$ conda list pandas
# Name Version Build Channel
pandas 0.23.2 py36h04863e7_0
Edit: This may be nothing. If I wait longer and garbage collect it seems to clear up.
Relevant discussion: https://github.com/dask/dask/issues/3530
setting MALLOC_MMAP_THRESHOLD_=16384
results in a significant improvement using the original code that @mrocklin posted.
When using
read_csv
in threads it appears that the Python process leaks a little memory.This is coming from this dask-focused stack overflow question: https://stackoverflow.com/questions/48954080/why-is-dask-read-csv-from-s3-keeping-so-much-memory
I've reduced it to a problem with
pandas.read_csv
and aconcurrent.futures.ThreadPoolExecutor
Code Sample, a copy-pastable example if possible
Output of
pd.show_versions()