Faster cache deserialization

JukkaL commented 7 years ago

Deserialization of cache files is a major bottleneck in incremental mode, so any performance improvements there would be valuable.

At least @pkch has previously worked on this, and it's unclear if any significant wins are achievable without a lot of work. However, there seem to be some redundant information in the cache files that we haven't looked at eliminating yet.

Here are some random ideas:

Use shared type objects at least for common types such as int and str.
Don't include default attribute values for the most common AST nodes. For example, we could have "defaults": true to mark that we can skip processing non-default values altogether.
Use shared dummy objects for things like empty function bodies.
Micro-optimize deserialization of the most common node types.
Avoid repeating information about arguments in the serialized form of FuncDefs.

Idea for how to move forward:

Run a profiling run when type checking mypy and looks for promising hot spots (spoiler: there probably aren't any major hot spots visible in the profile). Post results here.
Determine which node types are the most common ones during deserialization. Also calculate cumulative figures and percentages (of all nodes). Post results here.
Prototype various potential optimizations and benchmark them individually. Note that getting reliable benchmark results is hard because of turbo modes, background processes, etc. I may look at setting up a dedicated system for running benchmarks.
If speedups from individual optimizations are negligible (under 0.5%) or negative, give up on them and post the negative results here. Otherwise we risk others doing the same experiment in the future.
If the total speedup from all optimizations would be under, say, 10%, it probably doesn't make sense to land any of them, unless some optimizations are utterly trivial.

In any case, this issue isn't important enough to spend a lot of time on it. We have other plans to speed up mypy, but as they will take more effort, so some quick wins here would be nice.

JelleZijlstra commented 7 years ago

Random idea: Could we use a different format than JSON, perhaps marshal?

gvanrossum commented 7 years ago

JSON serialization itself is not the bottleneck. The json stdlib module uses highly optimized C code when (de)serializing using default options, which we are careful to use unless you pass --debug-cache. The cost is all in constructing dicts that the json module can serialize, and (after deserialization) constructing objects from those same dicts. This is what the serialize() and deserialize() methods do.

ilevkivskyi commented 7 years ago

Another random idea (probably I have mention it before): use pickle. It supports custom stateful objects via __getstate__ and __setstate__. The downside is that cache will be human-unreadable (unlike JSON). But if this will give a significant speed-up, maybe we can still consider this?

(Also I think stdlib __getstate__/__setstate__ will be more familiar to new contributors than custom serialize/deserialize methods.)

gvanrossum commented 7 years ago

Hm, but the __getstate__/__setstate__ methods have to do about the same amount of work as serialize/deserialize so I don't expect a big benefit from this. Also people have generally soured on pickle due to various vulnerabilities and incompatibilities.

ilevkivskyi commented 7 years ago

Yes, but on the other hand, some of the methods could be simpler, since intermediate dicts are not created (everything is put in objects __dict__ directly), plus some work is done already in __init__, plus some method calls are C-level. I agree it is not obvious, and benchmarking is needed to prove any win with this approach.

ilevkivskyi commented 7 years ago

Just another random idea. I have heard from a friend who works on a project that has a feature very similar to our --incremental mode: they recently switched from JSON to using SQL and have seen a reasonable performance boost. I am not aware of any details, and my experience with SQL is very limited, just wanted to share this idea.

I understand that the bottleneck is probably not parsing JSON but creating actual instances, but I have heard that SQL supports self-referential data structures, so that maybe we can save some time also in fixup.py.

JukkaL commented 7 years ago

Decreasing priority for now. We'll be focusing on fine-grained incremental checking first.

ethanhs commented 4 years ago

Is this still a concern? I believe we now have a SQL store for the cache as an option. If it is still a concern, I do like the idea of using __getstate__/__setstate__.

JukkaL commented 4 years ago

We now have an option of using sqlite for storing cache information, which is helpful. Deserialization performance is still an issue for large builds, though.

Using __getstate__/__setstate__ can be tricky because we must allow cache files to be loaded incrementally, and cache files can have references to things defined in other cache files. Import cycles can have cyclic references.

Another idea not discussed above is to deserialize cache files lazily. For example, consider a module m that (directly or indirectly) depends on 5000 other modules. Currently we will deserialize all the 5000 modules if m is dirty. However, usually most of the dependencies are not actually required to type check m. We could plausibly delay the deserialization of a module until we actually need some definition from the module while type checking. The details are still hazy, but this seems like a big potential win, though it's non-trivial to implement.

msullivan commented 4 years ago

Our use of SQL is totally half-assed, though. We have a table with filename, data, mtime columns and we store our json blobs in it.

vharitonsky commented 1 year ago

Hi, is there any way to implement custom serialization/deserialization like pickle? In our case loading caches from disk takes almost as long as generating them anew.

JukkaL commented 1 year ago

Another option is to write a custom binary serializer, but it's kind of awkward to implement right now. Once mypyc adds better support for encoding and decoding binary data, this should be feasible, and it would be a real-world use case for the relevant, though currently still hypothetical, mypyc functionality. This would have to be designed carefully to avoid complicating the maintenance of mypy serialization code, if we decide to take this route.

Faster cache deserialization could help once we have parallel type checking (#933), at least if parallel workers would use the mypy cache to share data with each other.

Cache size reduction is a related issue (#15731).

JukkaL commented 9 months ago

A faster, custom binary cache format is a promising way to speed up deserialization. It can be combined with some of the other ideas discussed above.

Constructing objects is likely a big bottleneck. A JSON parser constructs a string object for each dictionary key, and these are discarded afterwards. Also, many of the list and dictionary objects produced during JSON parsing are discarded during deserialization. A custom deserialization format would let us avoid most of this overhead.

I did a small experiment, which suggests that JSON parsing likely accounts for roughly half of the time spent in deserialization during self checking. Another interesting observation is that most of the strings in the data occur at least twice per file.

I will explain my proposal for the binary format next.

Binary cache format

Each serialized value can be encoded as a 8-bit type tag followed by type-specific binary data. Values can be either primitives (int, float, str, bool, None, list, or dict) or instances (e.g. TupleType).

Each primitive type has a separate encoding. For example, int would be encoded as a variable-length sequency of bytes. For example, if the high bit is set, additional bytes will follow.

Instance data has a common encoding as (name, value) fields that represent attributes. Attributes are encoded as a single byte tag representing the name, followed by the value. The value uses the general value encoding. The (name, value) fields must be in a specific order, but some may be omitted, depending on the values of earlier attributes. The field name tags are shared between instance types, so that each instance type with a name attribute will use the same tag for name.

String encoding

Since most string values tend to have duplicates in the same cache file, sharing string objects would improve efficiency. Each string would either be encoded as length followed by contents, or as a reference to a previously decoded string. Using references to strings lets us share string objects, which would both speed up decoding and reduce memory use, at the cost of slightly slower encoding.

It may also be worth it to have an efficient encoding of shared module name prefixes of fullnames, since many fullnames have long -- but shared -- module name prefixes. This doesn't seem essential, though.

Analysis

This format has some nice features:

Byte values can be used for type tags and key tags. These are very fast to process compared to string values, for example.
The size of encoded data could be less than 1/4 of the corresponding JSON.
Since attribute names and types of values are represented explicitly in the data, we can easily create a generic converter from the binary format to JSON, as long as it knows the meanings of type and name tags. This lets us support existing use cases which parse mypy JSON cache files.
The explicit representation of attribute names adds some redundancy which lets us detect invalid data quite reliably.

There are some drawbacks and limitations as well:

Deserialization when using an interpreted mypy may be significantly slower compared to JSON -- perhaps by a factor of 2. This will slow down tests somewhat. A small C extension could mostly work around this.
A small, fixed set of attribute names and instance types are supported. We shouldn't need more than 200 of each, so a single byte should be sufficient for encoding for now, but we might need to support two-byte variants in the (distant) future.
Attribute name tags are technically redundant and will (slightly) slow down decoding.

Implementation sketch

Module mypy/serialize/fields.py would define all the field tags:

  # Field tags (must be unique and in range 0..255)
  name: Final = 0
  line: Final = 1
  items: Final = 2
  ...

Module mypy/serialize/types.py would define all the type tags:

  # Type tags (but be unique and in range 0..255)
  instance: Final = 0  # Instance
  tuple_type: Final = 1  # TupleType
  ...

Some type tags would be reserved for primitives, such as str.

Serialization and deserialization methods could look like this:

  from mypy.serialize import fields, types

  ...
  class TupleType:
     ...
     def serialize(self, w: Writer) -> None:
         w.write_type_tag(types.tuple_type)
         w.write_type_list_field(fields.items, self.items)
         w.write_instance_field(fields.partial_fallback, self.partial_fallback)
         w.write_bool_field(fields.implicit, self.implicit)

     @classmethod
     def deserialize(cls, r: Reader) -> Instance:
         r.read_type_tag(types.tuple_type)
         return TupleType(
             r.read_type_list_field(fields.items),
             r.read_instance_field(fields.partial_fallback),
             implicit=r.read_bool_field(fields.implicit),
         )

Performance with mypyc

Currently, performance with the binary format could probably be better than JSON when compiled with mypyc, but worse without compilation. With further mypyc improvements, we can probably squeeze in a bunch of additional performance. It's too early to give precise estimates, but I expect that we can make deserialization 2x to 3x faster with a binary format (assuming some new mypyc features).

python / mypy