Closed thomasvandoren closed 3 years ago
To verify this change, I checked the html output to ensure the messages/threads/etc were still getting built. It looked ok, however I would not say I did a rigorous job of ensuring correctness...
What's the best way to ensure the behavior hasn't changed? Thanks!
Thanks for @thomasvandoren. As for testing, if you did a visual inspection and everything looked good, that should be good enough given the changes you're making. (Test coverage is pretty bare).
I ran into issues with getting a relatively large archive (~100MB compressed zip) to load. I did some performance analysis with line profiler, and discovered most of the time was in _build_thread() doing trivial string compares... lots and lots of them.
Refactor _build_thread() to iterate through
channel_data[channel_name]
twice. This is an improvement over the N squared (at least) iteration that was previously used, which resulted in more iterations that slowed down performance. Eliminating the nested loop overchannel_data[channel_name]
is done by pre-computing a dictionary (O(1) lookup) that can be used to look up messages by user and timestamp.Two other small improvements are:
use enumerate to keep track of the location while iterating, without needing to call the index() function on individual iterations
check for the specific keys in the _message dict, instead of using an iterator and all() (this probably is negligible perf, and is more for clarity)
I also tested the impact of json parsing by implementing an optional use of the orjson library when it's available. It did not have any measurable impact on performance in tests with the archives I have, so I left it out.
These are the perf test results run on an intel macbook pro with the ~100MB compressed zip slack export:
_build_thread
_build_thread
The command I used: