pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.83k forks source link

Possible memory leak bug with `to_sql` with chunksize ? #26569

Closed Santosh-Gupta closed 5 years ago

Santosh-Gupta commented 5 years ago

Code Sample, a copy-pastable example if possible

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)

Problem description

I created a large database in Pandas, about 6 million rows of text data. I wanted to save this as a SQL database file, but when I try to save it, I get an out of memory RAM error. I even reduced the chuck size to 100 and it still crashes.

However, if I just have smaller version of that dataframe with 100,000 rows, and save it to a database with no chucksize specified, I have no issues saving tha dataframe.

My understanding was that since it's only processing 100 rows at a time, the RAM usage should reflect that of a save of 100 rows. Is there something else happening behind the scenes? Perhaps multi-threading?

Before I run this code, I am using 4.8 GB RAM, out of the 12.8 GB RAM available in Google Colab. Running the above code eats up all the RAM until the enviroment crashes.

I would like to be able to save my pandas dataframe to a SQL file without my environment crashing. The environment I am in is Google Colab. The pandas datafame is 2 columns, ~6 million rows. Each cell contains about this much text:

"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."

I posted a SO issue here

https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi

And one of the users said

From stepping through the code I think it's this line, which reads creates a bunch of DataFrames

https://github.com/pandas-dev/pandas/blob/a91da0c94e541217865cdf52b9f6ea694f0493d3/pandas/io/sql.py#L676

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

Which looks like it's probably a bug. Specifically this happens prior to database insertion, in preparation.

Expected Output

The code should be executed without a crash.

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.6.7.final.0 python-bits: 64 OS: Linux OS-release: 4.14.79+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.2 pytest: 3.6.4 pip: 19.1.1 setuptools: 41.0.1 Cython: 0.29.7 numpy: 1.16.3 scipy: 1.3.0 pyarrow: 0.13.0 xarray: 0.11.3 IPython: 5.5.0 sphinx: 1.8.5 patsy: 0.5.1 dateutil: 2.5.3 pytz: 2018.9 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.9 feather: 0.4.0 matplotlib: 3.0.3 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml.etree: 4.2.6 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.3.3 pymysql: None psycopg2: 2.7.6.1 (dt dec pq3 ext lo64) jinja2: 2.10.1 s3fs: 0.2.1 fastparquet: None pandas_gbq: 0.4.1 pandas_datareader: 0.7.0 gcsfs: None
WillAyd commented 5 years ago

Do you have any way of profiling?

Santosh-Gupta commented 5 years ago

I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab.

Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code.

Santosh-Gupta commented 5 years ago

Here is the notebook I am using , with the to_sql code at the bottom.

https://drive.google.com/open?id=1j7JnDlmIP5QE9lwLuQo-z3jABElaesZN

The code needs my data from my Google drive folder. I could not find how let a new user download directly all contents of my folder, so I am just going to share my folder with this link

https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8

Import this into your drive (should note take up any room in your drive since I am hosting it)

Then run all the cells in the notebook. Warning, this will ask for your Google drive credentials in the first block of code.

It's possible for me to make a minimal example, however it's very tricky because I need to make the dataframe big enough for it to fail with to_sql, but small enough to still fit inside colab.

Let me know if there's anything else I can do to make looking into this issue easier.

WillAyd commented 5 years ago

Here's an example of profiling if you can isolate the issue in your code:

https://github.com/pandas-dev/pandas/issues/25893#issuecomment-477208844

Santosh-Gupta commented 5 years ago

Not quite sure how to use that

I awkwardly tried

process = psutil.Process(os.getpid())
i = 0

def sink():
    global i
    i += 1
    if i % 100 == 0:
        mem = process.memory_info().rss / 1e6
        print("mem %fMB" % mem)

while True:
    dfAllT.to_sql("CS_table2", engine, chunksize = 100)
    sink()

and

process = psutil.Process(os.getpid())
i = 0

def sink():
    global i
    i += 1
    if i % 100 == 0:
        mem = process.memory_info().rss / 1e6
        print("mem %fMB" % mem)

while True:
    dfAllT.to_sql("CS_table2", engine, chunksize = 100).rolling(4000).max()
    sink()

But neither produced any output; my enviroment crashed without outputting anything.

TomAugspurger commented 5 years ago

https://pypi.org/project/memory-profiler/ for memory profiling.

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports for making minimal examples.

On Thu, May 30, 2019 at 4:34 PM Santosh Gupta notifications@github.com wrote:

I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab.

Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/26569?email_source=notifications&email_token=AAKAOIWPWDBBVCGITXRDIE3PYBB5JA5CNFSM4HQYBPCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWTSSAI#issuecomment-497494273, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIXXUYTUEYLC7YS6AWDPYBB5JANCNFSM4HQYBPCA .

Santosh-Gupta commented 5 years ago

I believe the issue is not with Pandas but with sqlalchemy. I used something like for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) as an alternative method for chucking but the ram still built up and crashed. So perhaps some sort of memory leak in sqlalchemy.

Here's a Stackoverflow link for anyone interested in continuing the monitor this issue.

https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi

Here's is an minimal example to anyone wanting to replicate the issue

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
Santosh-Gupta commented 5 years ago

I mentioned this on SQLalchemy and they said that it's unlikely it's happening there, so the issue might be in SQLite driver or in pandas.

It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.

I had a hard time trying to replicate the memory leak that occurred when converting my data, through a minimal example. But this gets pretty close. With my data, it would result in a memory crash no matter what I did. But when I tried to create a self-contained minimal example, I have to play around with parameters to cause the leak.

The following code, when run in Google colab CPU mode, has a leak, although it's not severe enough to cause it to crash.

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

This was run on Google Colab CPU enviroment.

The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions, repeated insertions via looping to_sql , or one to_sql with chucksize specified.

Santosh-Gupta commented 5 years ago

Turns out it's a Python issue

https://stackoverflow.com/questions/56418825/repeated-insertions-into-sqlite-database-via-sqlalchemy-causing-memory-leak/56511487?noredirect=1#comment99616175_56511487