pythonicrubyist / creek

Ruby library for parsing large Excel files.
http://rubygems.org/gems/creek
MIT License
388 stars 109 forks source link

Excessive Memory Allocation Loading Worksheet Rows #111

Closed zacheryph closed 1 year ago

zacheryph commented 1 year ago

I am noticing excessive temporary memory allocation loading a worksheet. I cannot share said spreadsheet currently. This particular spreadsheet is 5.8MB, single sheet, 50K records. I cannot see anything obvious within rows causing this but it just seems a bit excessive for a 6MB file ;)

Reason I bring this up is it's killing kubernetes pods for us for memory usage when running. I am attempting to bump the limits to get around it.

Anything I can do to help I'm glad to.

Note, memory usage is from Docker stats

Base Rails Console Memory: 168.3MiB Load Book (book = Creek::Book.new("report.xlsx")): 305.8MiB Force a GC (GC.start): 210.6MiB Lookup First Row (book.sheets.first.rows.first): 1.247GiB Force a GC (GC.start): 222.9MiB

zacheryph commented 1 year ago

This also appears to be new to 2.6.2. I do not see this happening in 2.5.3 so I have rolled back our update.

pythonicrubyist commented 1 year ago

How many records are on that sheet?

zacheryph commented 1 year ago

50K.

With this particular spreadsheet, with Creek 2.5.3 memory usage (watching docker stats) never exceeds ~ 450-500M but with Creek 2.6.2 memory usage peaks at 1.1-1.2GB

md5 commented 1 year ago

We started seeing OOM killer issues in containerized workloads after upgrading to creek-2.6.2 as well and were able to alleviate the issue by downgrading to 2.5.3. Thanks for the report @zacheryph!

I believe the issue was introduced as part of #101 in this commit: 494ed0505930ef2a70c26204bf8231228b6aa963

Specifically, it's repeatedly building "#{prefix}row", "#{prefix}c", "#{prefix}v", and "#{prefix}t" strings for every node that is read. I think that instead of doing that, it needs to construct row_selector, cell_selector, value_selector, and text_selector strings only when prefix is first set to a non-'' value (with those new variables defaulting to the non-prefixed names like 'row').

While we're micro-optimizing, I think we probably don't want [value_selector, text_selector].include?(node.name) either, since that's constructing an array on every iteration. Directly checking node.name == value_selector || node.name == text_selector would be more performant.