palewire / django-calaccess-campaign-browser

A Django app to refine, review and republish campaign finance data drawn from the California Secretary of State’s CAL-ACCESS database
http://django-calaccess-campaign-browser.californiacivicdata.org
MIT License
17 stars 12 forks source link

`build_campaign_finance` leaks memory #8

Closed aboutaaron closed 10 years ago

aboutaaron commented 10 years ago

Problem

According to @armendariz, the management command takes forever to run and will terminate when it hits about 4 million records. So, we need to find a way to reduce the bottle neck.

Hypothesis

My initial thought is that the management command runs several functions in succession that shove a ton of data into MySQL via the Django ORM. As a result, these processes are costly an end up eating a ton of memory since MySQL, Django and python are firing all cylinders a once.

Solutions

Python is supposed to be garbage collected automatically, but perhaps the functions aren't doing a decent job at that. One hacky way to get around this is to use gc to for python to release objects from memory once the function ends:

def load_data():
    # <truncated code ...>
    import gc
    gc.collect()

More on StackOverflow: How can I explicitly free memory in Python?.

This may be incorrect, but this was my initial thought.

aboutaaron commented 10 years ago

Also worth checking: http://blog.gingerlime.com/2011/django-memory-leaks-part-ii/

aboutaaron commented 10 years ago

Also worth mentioning, it looks like the commands finished for me without any problems.

armendariz commented 10 years ago

I'm using gc.collect() I'm doing django.db.reset_queries() And now I'm iterating over querysets using this snippet: https://djangosnippets.org/snippets/1949/ Not sure why my process gets killed I just started running it again. Fingers crossed. Glad it worked for you though Aaron.

aboutaaron commented 10 years ago

I think you solved this @armendariz. Feel free to reopen the issue if you think otherwise.

armendariz commented 10 years ago

@aboutaaron Thanks! It is solved. Slowly figuring out the Github thing