pythonicrubyist / creek

Ruby library for parsing large Excel files.
http://rubygems.org/gems/creek
MIT License
386 stars 109 forks source link

SharedStrings are entirely loaded in memory #85

Closed dbernheisel closed 4 years ago

dbernheisel commented 4 years ago

I'm trying to parse the first couple of rows of a large XLSX, and it seems that the entire workbook's SharedStrings is loaded when calling Creek::Book.new(file), which somewhat defeats the purpose of streaming the rows efficiently.

I tested the memory performance when loading a 12MB XLSX.

require 'benchmark/memory'

describe 'Creek opens files efficiently' do
  it 'foo' do
    Benchmark.memory do |x|
      x.report("opening") { Creek::Book.new 'spec/fixtures/large.xlsx' }
    end
  end
end

Here are the results:

Calculating -------------------------------------
             opening    27.576M memsize (   696.000  retained)
                       257.683k objects (    11.000  retained)
                        50.000  strings (     9.000  retained)

When I comment out loading SharedStrings, then I don't see that memory bloat:

# lib/creek/shared_strings.rb
# lines 17-22

+      @dictionary = Hash.new
+      # if @book.files.file.exist?(path)
+      #   doc = @book.files.file.open path
+      #   xml = Nokogiri::XML::Document.parse doc
+      #   parse_shared_string_from_document(xml)
+      # end
-      if @book.files.file.exist?(path)
-        doc = @book.files.file.open path
-        xml = Nokogiri::XML::Document.parse doc
-        parse_shared_string_from_document(xml)
-      end
Calculating -------------------------------------
             opening    88.282k memsize (     0.000  retained)
                       319.000  objects (     0.000  retained)
                        35.000  strings (     0.000  retained)

Is there a way to get shared strings lazily?

pythonicrubyist commented 4 years ago

Lazy loading shared strings reduces performance for worksheets with a small number of shared strings. Given that most excel files have small shared strings, I think it is a better for the majority of users to avoid lazy loading shared strings.