Closed Santosh-Gupta closed 5 years ago
Do you have any way of profiling?
I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab.
Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code.
Here is the notebook I am using , with the to_sql
code at the bottom.
https://drive.google.com/open?id=1j7JnDlmIP5QE9lwLuQo-z3jABElaesZN
The code needs my data from my Google drive folder. I could not find how let a new user download directly all contents of my folder, so I am just going to share my folder with this link
https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8
Import this into your drive (should note take up any room in your drive since I am hosting it)
Then run all the cells in the notebook. Warning, this will ask for your Google drive credentials in the first block of code.
It's possible for me to make a minimal example, however it's very tricky because I need to make the dataframe big enough for it to fail with to_sql
, but small enough to still fit inside colab.
Let me know if there's anything else I can do to make looking into this issue easier.
Here's an example of profiling if you can isolate the issue in your code:
https://github.com/pandas-dev/pandas/issues/25893#issuecomment-477208844
Not quite sure how to use that
I awkwardly tried
process = psutil.Process(os.getpid())
i = 0
def sink():
global i
i += 1
if i % 100 == 0:
mem = process.memory_info().rss / 1e6
print("mem %fMB" % mem)
while True:
dfAllT.to_sql("CS_table2", engine, chunksize = 100)
sink()
and
process = psutil.Process(os.getpid())
i = 0
def sink():
global i
i += 1
if i % 100 == 0:
mem = process.memory_info().rss / 1e6
print("mem %fMB" % mem)
while True:
dfAllT.to_sql("CS_table2", engine, chunksize = 100).rolling(4000).max()
sink()
But neither produced any output; my enviroment crashed without outputting anything.
https://pypi.org/project/memory-profiler/ for memory profiling.
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports for making minimal examples.
On Thu, May 30, 2019 at 4:34 PM Santosh Gupta notifications@github.com wrote:
I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab.
Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/26569?email_source=notifications&email_token=AAKAOIWPWDBBVCGITXRDIE3PYBB5JA5CNFSM4HQYBPCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWTSSAI#issuecomment-497494273, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIXXUYTUEYLC7YS6AWDPYBB5JANCNFSM4HQYBPCA .
I believe the issue is not with Pandas but with sqlalchemy. I used something like for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...)
as an alternative method for chucking but the ram still built up and crashed. So perhaps some sort of memory leak in sqlalchemy.
Here's a Stackoverflow link for anyone interested in continuing the monitor this issue.
Here's is an minimal example to anyone wanting to replicate the issue
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
I mentioned this on SQLalchemy and they said that it's unlikely it's happening there, so the issue might be in SQLite driver or in pandas.
It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.
I had a hard time trying to replicate the memory leak that occurred when converting my data, through a minimal example. But this gets pretty close. With my data, it would result in a memory crash no matter what I did. But when I tried to create a self-contained minimal example, I have to play around with parameters to cause the leak.
The following code, when run in Google colab CPU mode, has a leak, although it's not severe enough to cause it to crash.
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
This was run on Google Colab CPU enviroment.
The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions, repeated insertions via looping to_sql , or one to_sql with chucksize specified.
Code Sample, a copy-pastable example if possible
Problem description
I created a large database in Pandas, about 6 million rows of text data. I wanted to save this as a SQL database file, but when I try to save it, I get an out of memory RAM error. I even reduced the chuck size to 100 and it still crashes.
However, if I just have smaller version of that dataframe with 100,000 rows, and save it to a database with no chucksize specified, I have no issues saving tha dataframe.
My understanding was that since it's only processing 100 rows at a time, the RAM usage should reflect that of a save of 100 rows. Is there something else happening behind the scenes? Perhaps multi-threading?
Before I run this code, I am using 4.8 GB RAM, out of the 12.8 GB RAM available in Google Colab. Running the above code eats up all the RAM until the enviroment crashes.
I would like to be able to save my pandas dataframe to a SQL file without my environment crashing. The environment I am in is Google Colab. The pandas datafame is 2 columns, ~6 million rows. Each cell contains about this much text:
I posted a SO issue here
https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi
And one of the users said
From stepping through the code I think it's this line, which reads creates a bunch of DataFrames
https://github.com/pandas-dev/pandas/blob/a91da0c94e541217865cdf52b9f6ea694f0493d3/pandas/io/sql.py#L676
chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
Which looks like it's probably a bug. Specifically this happens prior to database insertion, in preparation.
Expected Output
The code should be executed without a crash.
Output of
pd.show_versions()