Improve performance of Reader._build_thread().

thomasvandoren commented 3 years ago

I ran into issues with getting a relatively large archive (~100MB compressed zip) to load. I did some performance analysis with line profiler, and discovered most of the time was in _build_thread() doing trivial string compares... lots and lots of them.

Refactor _build_thread() to iterate through channel_data[channel_name] twice. This is an improvement over the N squared (at least) iteration that was previously used, which resulted in more iterations that slowed down performance. Eliminating the nested loop over channel_data[channel_name] is done by pre-computing a dictionary (O(1) lookup) that can be used to look up messages by user and timestamp.

Two other small improvements are:

use enumerate to keep track of the location while iterating, without needing to call the index() function on individual iterations
check for the specific keys in the _message dict, instead of using an iterator and all() (this probably is negligible perf, and is more for clarity)

I also tested the impact of json parsing by implementing an optional use of the orjson library when it's available. It did not have any measurable impact on performance in tests with the archives I have, so I left it out.

These are the perf test results run on an intel macbook pro with the ~100MB compressed zip slack export:

Perf test	Tests Run	Average time (seconds)	Stdev time (seconds)
Optimized `_build_thread`	10	24.5	1.3
orjson + Optimized `_build_thread`	10	24.8	0.8
Original	1	900.0	0 (only ran once and it didn't complete after 15 minutes)

The command I used:

time slack-export-viewer --no-browser -p 8088 --test -z Test\ Slack\ export\ Test.zip

thomasvandoren commented 3 years ago

To verify this change, I checked the html output to ensure the messages/threads/etc were still getting built. It looked ok, however I would not say I did a rigorous job of ensuring correctness...

What's the best way to ensure the behavior hasn't changed? Thanks!

hfaran commented 3 years ago

Thanks for @thomasvandoren. As for testing, if you did a visual inspection and everything looked good, that should be good enough given the changes you're making. (Test coverage is pretty bare).

hfaran / slack-export-viewer

Improve performance of Reader._build_thread(). #142