votingworks / arlo

GNU Affero General Public License v3.0
143 stars 26 forks source link

Avoid loading large files from S3 into memory #2028

Open arsalansufi opened 4 weeks ago

arsalansufi commented 4 weeks ago

https://votingworks.slack.com/archives/CKCVA0F9S/p1730309143498049

carolinemodic commented 3 weeks ago

Dominion - no bottlenecks Hart - Loads all of the file names and metadata information into memory when unzipping zip folders. The zip folders in this CSV implementation have a LOT of very small files, so this can increase memory by up to a couple hundred mb for a 1.5gb zip file (containing 8 sub-zip files, each containing 100k+ files). I think this is ok and we would need an insanely huge zip for it to be problematic. ES&S - As previously identified this part of the code loads the entire ballots file into memory to sort it and is very expensive in terms of memory (seems to increase memory by much more then the file size actually). We would have to use temporary files to write to and do a merge sort of different chunks of the file to solve. Doable but not trivial. Clear Ballot - Have not prioritized testing but quickly glancing at the code I think its likely fine.

carolinemodic commented 3 weeks ago

Passing off to @eventualbuddha to investiage the ess issue further

eventualbuddha commented 2 weeks ago

I investigated resource usage for processing ES&S CVR & ballot files. I used @carolinemodic's generated files that had about 500K CVRs:

~/Downloads/arlo-scale-testing-11-2024 
❯ cat ballotsbig.csv | wc -l
 2611511

~/Downloads/arlo-scale-testing-11-2024 
❯ cat cvrbig.csv | wc -l
  522303

It is indeed slow and memory intensive. Using a manually-created Jurisdiction and set of uploads, I ran the following script to measure the time and memory usage:

#!/usr/bin/env python

import os
import time

os.environ["FLASK_ENV"] = "development"

from server.api.cvrs import parse_ess_cvrs
from server.models import Jurisdiction

jurisdiction_id = "9eae837e-37b2-4ef7-ac4a-5977c403608e"
jurisdiction = Jurisdiction.query.get(jurisdiction_id)

start_time = time.time()
(metadata, cvrs) = parse_ess_cvrs(jurisdiction, "/tmp/arlo-perf-test")

count = 0
try:
    for cvr in cvrs:
        count += 1
finally:
    end_time = time.time()
    print("Parsed %d CVRs in %s seconds" % (count, end_time - start_time))

    # print rss memory usage
    import resource

    print("Memory usage: %s (kb)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

I also watched the process with htop and added some more granular logs in the parse_ess_cvrs function. At maximum RAM usage, this used about 3.4GB of memory. At the point where it crashes due to a mismatch in the data between the files, it also took about 26s to get to the point where it starts matching entries.

This isn't great, but should be within what a worker dyno can deal with as we've configured them. In practice, the largest set of ES&S CVRs we're likely to see is about 180K records, or about 1/3 of this "big" test fixture. With a RAM floor of about 130MB for our python process, that means a RAM utilization of somewhere around 1.3GB.

I also did a spike of using Rust to process these files. It's about 20x faster (1.2s) and 13x less RAM (250MB), and I wasn't even trying to make it efficient. There are huge gains to be had by moving away from python for anything CPU bound.

eventualbuddha commented 2 weeks ago

Unassigned myself so @arsalansufi can re-prioritize.