Closed AlexKovalevych closed 7 years ago
While this memory optimization would only apply to the Xlsxir.get_list/1
use case, I'd find it useful. Currently when parsing a 90K row sheet the RAM usage is growing by 600 to 700 MB.
@pma would you mind helping me test Alex's PR ( #52)? I would appreciate your thoughts on it.
@jsonkennell Will do. I'll compare processing time and RAM usage.
@jsonkennell @AlexKovalevych
I share my test below. The total time of loading and iterating the rows was cut in half.
One comment/suggestion I have is that if it's now possible to stream the output directly from the SAX parser, it would be great if we could have streams all the way up to the public API. Xlsxir.stream would make it easier to compose with things like GenStage/Flow. It would also cut RAM usage since we would be able to avoid loading all the rows at the same time as an intermediate list.
Test: Read a 90984 row sheet (3.1 MB)
iex(1)> {t, {:ok, table_id}} = :timer.tc(fn -> Xlsxir.multi_extract("Test1.xlsx", 0) end)
iex(1)> :timer.tc(fn -> Xlsxir.get_list(table_id) |> Enum.count end)
branch | fun | time (s) |
---|---|---|
master | Xlsxir.multi_extract/2 | 33 |
master | Xlsxir.get_list/1 | 2 |
performance-boost | Xlsxir.multi_extract/2 | 9 |
performance-boost | Xlsxir.get_list/1 | 8 |
Memory usage increase after running test (in MB).
branch | proc mem | bin mem | ets mem |
---|---|---|---|
master | 150 | 40 | 83 |
performance-boost | 39 | 42 | 83 |
Merged #52 so closing for now.
I think we can improve parsing worksheets by not parsing them entirely in the memory, but instead parse line by line by request (using Stream).
Here is the idea:
sharedString.xml
the same way we do right now.worksheet#{i}.xml
and add it to the sax parser. Return parsed line.Of course, if the entire worksheet is a single line (which i think not happens often), we can't do that.