Open arsalansufi opened 4 weeks ago
Dominion - no bottlenecks Hart - Loads all of the file names and metadata information into memory when unzipping zip folders. The zip folders in this CSV implementation have a LOT of very small files, so this can increase memory by up to a couple hundred mb for a 1.5gb zip file (containing 8 sub-zip files, each containing 100k+ files). I think this is ok and we would need an insanely huge zip for it to be problematic. ES&S - As previously identified this part of the code loads the entire ballots file into memory to sort it and is very expensive in terms of memory (seems to increase memory by much more then the file size actually). We would have to use temporary files to write to and do a merge sort of different chunks of the file to solve. Doable but not trivial. Clear Ballot - Have not prioritized testing but quickly glancing at the code I think its likely fine.
Passing off to @eventualbuddha to investiage the ess issue further
I investigated resource usage for processing ES&S CVR & ballot files. I used @carolinemodic's generated files that had about 500K CVRs:
~/Downloads/arlo-scale-testing-11-2024
❯ cat ballotsbig.csv | wc -l
2611511
~/Downloads/arlo-scale-testing-11-2024
❯ cat cvrbig.csv | wc -l
522303
It is indeed slow and memory intensive. Using a manually-created Jurisdiction and set of uploads, I ran the following script to measure the time and memory usage:
#!/usr/bin/env python
import os
import time
os.environ["FLASK_ENV"] = "development"
from server.api.cvrs import parse_ess_cvrs
from server.models import Jurisdiction
jurisdiction_id = "9eae837e-37b2-4ef7-ac4a-5977c403608e"
jurisdiction = Jurisdiction.query.get(jurisdiction_id)
start_time = time.time()
(metadata, cvrs) = parse_ess_cvrs(jurisdiction, "/tmp/arlo-perf-test")
count = 0
try:
for cvr in cvrs:
count += 1
finally:
end_time = time.time()
print("Parsed %d CVRs in %s seconds" % (count, end_time - start_time))
# print rss memory usage
import resource
print("Memory usage: %s (kb)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
I also watched the process with htop
and added some more granular logs in the parse_ess_cvrs
function. At maximum RAM usage, this used about 3.4GB of memory. At the point where it crashes due to a mismatch in the data between the files, it also took about 26s to get to the point where it starts matching entries.
This isn't great, but should be within what a worker dyno can deal with as we've configured them. In practice, the largest set of ES&S CVRs we're likely to see is about 180K records, or about 1/3 of this "big" test fixture. With a RAM floor of about 130MB for our python process, that means a RAM utilization of somewhere around 1.3GB.
I also did a spike of using Rust to process these files. It's about 20x faster (1.2s) and 13x less RAM (250MB), and I wasn't even trying to make it efficient. There are huge gains to be had by moving away from python for anything CPU bound.
Unassigned myself so @arsalansufi can re-prioritize.
https://votingworks.slack.com/archives/CKCVA0F9S/p1730309143498049