Improve parsing worksheets

AlexKovalevych commented 7 years ago

I think we can improve parsing worksheets by not parsing them entirely in the memory, but instead parse line by line by request (using Stream).

Here is the idea:

Parse sharedString.xml the same way we do right now.
Create custom Stream for each worksheet.
When the next row is requested - read next line in the worksheet#{i}.xml and add it to the sax parser. Return parsed line.

Of course, if the entire worksheet is a single line (which i think not happens often), we can't do that.

pma commented 7 years ago

While this memory optimization would only apply to the Xlsxir.get_list/1 use case, I'd find it useful. Currently when parsing a 90K row sheet the RAM usage is growing by 600 to 700 MB.

jsonkenl commented 7 years ago

@pma would you mind helping me test Alex's PR ( #52)? I would appreciate your thoughts on it.

pma commented 7 years ago

@jsonkennell Will do. I'll compare processing time and RAM usage.

pma commented 7 years ago

@jsonkennell @AlexKovalevych

I share my test below. The total time of loading and iterating the rows was cut in half.

One comment/suggestion I have is that if it's now possible to stream the output directly from the SAX parser, it would be great if we could have streams all the way up to the public API. Xlsxir.stream would make it easier to compose with things like GenStage/Flow. It would also cut RAM usage since we would be able to avoid loading all the rows at the same time as an intermediate list.

Test: Read a 90984 row sheet (3.1 MB)

iex(1)> {t, {:ok, table_id}} = :timer.tc(fn -> Xlsxir.multi_extract("Test1.xlsx", 0) end)
iex(1)> :timer.tc(fn -> Xlsxir.get_list(table_id) |> Enum.count end)

branch	fun	time (s)
master	Xlsxir.multi_extract/2	33
master	Xlsxir.get_list/1	2
performance-boost	Xlsxir.multi_extract/2	9
performance-boost	Xlsxir.get_list/1	8

Memory usage increase after running test (in MB).

branch	proc mem	bin mem	ets mem
master	150	40	83
performance-boost	39	42	83

jsonkenl commented 7 years ago

Merged #52 so closing for now.

jsonkenl / xlsxir

Improve parsing worksheets #51