Improve `read_numeric()` to vastly increase `parse()` performance for all tags

MestreLion commented 2 years ago

When doing some profiling loading NBT files, trying to optimize loading times, read_numeric() stands at the top by a large margin. Taking a closer look at it, it seems this is the culprit:

def get_format(fmt, string):
    """Return a dictionary containing a format for each byte order."""
    return {"big": fmt(">" + string), "little": fmt("<" + string)}

BYTE = get_format(Struct, "b")
SHORT = get_format(Struct, "h")
...
def read_numeric(fmt, fileobj, byteorder="big"):
    """Read a numeric value from a file-like object."""
    try:
        fmt = fmt[byteorder]
        return fmt.unpack(fileobj.read(fmt.size))[0]
        ...

And that is universally used in all tag classes using a similar pattern:

tag_id = read_numeric(BYTE, fileobj, byteorder)
length = read_numeric(INT, fileobj, byteorder)
tag = cls.get_tag(read_numeric(BYTE, fileobj, byteorder))
data = fileobj.read(read_numeric(INT, fileobj, byteorder) * item_type.itemsize)
...

The problem is: read_numeric creates a new Struct instance on every read. That is a very expensive operation. There should probably be a way to pre-build (or cache) such instances, so either read_numeric or get_format or even BYTE/INT... contain/return the same struct instances, while still keeping the ability to select byteorder on a per-call basis.

I can submit a PR to fix this, and I'm sure reading (and writing) times will vastly improve. I'll do so in a way it does not change the API of any of the tag classes (i.e, keep Compound.parse(cls, fileobj, byteorder="big") signature for all write/parse of all tags), and possibly keep read_numeric() signature too (so no changes to the Tag classes at all), but most likely get_format() will change signature and/or internal structure, and the underlying BYTES/INT/... will most likely change their internal values, but I'll do my best to keep them still byteorder-agnostic constants .

Is such improvement welcome?

MestreLion commented 2 years ago

I just noticed that pre-made Struct instances are already saved on BYES/INT/..., so read_numeric() is not creating new instances per-call. Great!

But, still, are improvements to this crucial function welcome?

vberlier commented 2 years ago

At runtime, read_numeric should only perform a dictionary lookup to grab the appropriate struct format and then read and unpack the data. Of course it's in the hot path when parsing so performance improvements would be very welcome but I'm not sure if there's any opportunity for easy wins here. But feel free to experiment with it if you have something in mind!

MestreLion commented 2 years ago

That's why I created the benchmarks with other NBT implementations... no point doing experiments if I can't accurately measure the gains. And little point trying to improve what is already pretty damn good. My initial assumption that it was slow and could be "vastly improved" turned out to be wrong.

But still, one experiment I might try is to use an (attribute?) assignment once per File|Root.parse() that sets the endianness, instead of a "run-time" dictionary lookup for every tag. So when Compound.parse() says read_numeric(BYTE, ...), that BYTE would not be a big/little dictionary anymore, but already one of those values/Structs. The job of fmt[endian] would have already being performed by File. read_numeric would take not a dict, but a Struct (or whatever) of a given endianness that was set prior to that. And Compound, as now, would be completely unaware of all of this.

The point is that there is little point allowing endianess to be set on a per-tag basis. Either the whole file is little endian or big endian, so we can take advantage of this assumption.

Humm, perhaps Compound would have to be a little aware, as it may have to use self.BYTE instead of a module-wise BYTE. Humm, class attribute lookup. Bad tradeoff?

Benchmarks. We need benchmarks.

Or skip all of that and go Cython. Please!

MestreLion commented 2 years ago

An interesting optimization approach taken by Minecraft: it caches all 256 possible Byte values as pre-built instances.

vberlier / nbtlib

Improve `read_numeric()` to vastly increase `parse()` performance for all tags #150