jsonkenl / xlsxir

Xlsx parser for the Elixir language.
MIT License
215 stars 85 forks source link

Improve parsing worksheets #51

Closed AlexKovalevych closed 7 years ago

AlexKovalevych commented 7 years ago

I think we can improve parsing worksheets by not parsing them entirely in the memory, but instead parse line by line by request (using Stream).

Here is the idea:

  1. Parse sharedString.xml the same way we do right now.
  2. Create custom Stream for each worksheet.
  3. When the next row is requested - read next line in the worksheet#{i}.xml and add it to the sax parser. Return parsed line.

Of course, if the entire worksheet is a single line (which i think not happens often), we can't do that.

pma commented 7 years ago

While this memory optimization would only apply to the Xlsxir.get_list/1 use case, I'd find it useful. Currently when parsing a 90K row sheet the RAM usage is growing by 600 to 700 MB.

jsonkenl commented 7 years ago

@pma would you mind helping me test Alex's PR ( #52)? I would appreciate your thoughts on it.

pma commented 7 years ago

@jsonkennell Will do. I'll compare processing time and RAM usage.

pma commented 7 years ago

@jsonkennell @AlexKovalevych

I share my test below. The total time of loading and iterating the rows was cut in half.

One comment/suggestion I have is that if it's now possible to stream the output directly from the SAX parser, it would be great if we could have streams all the way up to the public API. Xlsxir.stream would make it easier to compose with things like GenStage/Flow. It would also cut RAM usage since we would be able to avoid loading all the rows at the same time as an intermediate list.

Test: Read a 90984 row sheet (3.1 MB)

iex(1)> {t, {:ok, table_id}} = :timer.tc(fn -> Xlsxir.multi_extract("Test1.xlsx", 0) end)
iex(1)> :timer.tc(fn -> Xlsxir.get_list(table_id) |> Enum.count end)
branch fun time (s)
master Xlsxir.multi_extract/2 33
master Xlsxir.get_list/1 2
performance-boost Xlsxir.multi_extract/2 9
performance-boost Xlsxir.get_list/1 8

Memory usage increase after running test (in MB).

branch proc mem bin mem ets mem
master 150 40 83
performance-boost 39 42 83
jsonkenl commented 7 years ago

Merged #52 so closing for now.