Add partial parsing of replay files

hohav / py-slippi

Python library for parsing SSBM replay files

MIT License

56 stars 25 forks source link

Hey, thanks for the great library.

This PR modifies slippi.game.Game to add the ability to parse only the metadata, game start and game end events. This speeds up parsing in situations where you may want to retrieve information about the overall game (characters, ports, stage, etc), but have no need to inspect the actual gameplay frames themselves.
My personal use-case is for loading a ton of replays into a database so that I can later search them by matchup, stage, etc. Parsing all the gameplay frames adds some significant processing time that this PR now allows the user to avoid if desired (see performance comparison below).

Changes

Added new optional argument, partial_parsing=False to the constructor of slippi.game.Game
Added new method, _parse_file_partial to slippi.game.Game
Added new test, test_game_partial_parse, and new helper method, _game_partial to test/replays.py

Performance Comparison

A quick check from ipython %timeit:

In [1]: from slippi.game import Game                                                                                                                                

In [2]: %timeit Game('./test/replays/game.slp')                                                                                                                     
316 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit Game('./test/replays/game.slp', partial_parse=True)                                                                                                 
44.4 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Notes

All tests are passing.

Based on something Fizzi said on discord, it may be possible to extract the game end event without iterating through the raw stream to get to it. If so this would add some additional speedup as we could stop parsing after the game start event. Possible future improvement.

Thanks & let me know of any input.

Hi, and thanks for the PR!

I've been working on something along similar lines in the streaming branch, but with a different approach: I parse each frame's data lazily, to pay the parsing penalty only when that frame's data is actually accessed. I also gain some performance by not parsing the whole replay as UBJSON, instead finding the raw element by looking for a specific byte sequence (hacky, but it's hinted at in the spec and it's what the official JS parser does).

Your branch is still notably faster for getting just the metadata and start/end events, and could be sped up a bit more by using the same hack as above plus skipping directly to the End Game event. But many users will want to do things like accessing the last frame to determine who won, which the lazy-parsing approach handles very well.

And even with the best speedups we could hope for, it'll still be fairly slow to get the metadata for a large directory of replays. A more scalable approach would be to create some sort of index, perhaps with SQLite. I think that's the only way to make things fast enough for an acceptable UX on large replay collections.

But please try out the streaming branch and let me know what you think. I'm inclined to go with that design for now, but I'm definitely still looking for feedback. I'll probably end up reworking things at least once more as use of py-slippi grows, in any case.

hohav / py-slippi