Open JukkaL opened 7 years ago
Random idea: Could we use a different format than JSON, perhaps marshal?
JSON serialization itself is not the bottleneck. The json stdlib module uses highly optimized C code when (de)serializing using default options, which we are careful to use unless you pass --debug-cache
. The cost is all in constructing dicts that the json module can serialize, and (after deserialization) constructing objects from those same dicts. This is what the serialize()
and deserialize()
methods do.
Another random idea (probably I have mention it before): use pickle
. It supports custom stateful objects via __getstate__
and __setstate__
. The downside is that cache will be human-unreadable (unlike JSON). But if this will give a significant speed-up, maybe we can still consider this?
(Also I think stdlib __getstate__
/__setstate__
will be more familiar to new contributors than custom serialize
/deserialize
methods.)
Hm, but the __getstate__
/__setstate__
methods have to do about the same amount of work as serialize
/deserialize
so I don't expect a big benefit from this. Also people have generally soured on pickle due to various vulnerabilities and incompatibilities.
Yes, but on the other hand, some of the methods could be simpler, since intermediate dict
s are not created (everything is put in objects __dict__
directly), plus some work is done already in __init__
, plus some method calls are C-level. I agree it is not obvious, and benchmarking is needed to prove any win with this approach.
Just another random idea. I have heard from a friend who works on a project that has a feature very similar to our --incremental
mode: they recently switched from JSON to using SQL and have seen a reasonable performance boost. I am not aware of any details, and my experience with SQL is very limited, just wanted to share this idea.
I understand that the bottleneck is probably not parsing JSON but creating actual instances, but I have heard that SQL supports self-referential data structures, so that maybe we can save some time also in fixup.py
.
Decreasing priority for now. We'll be focusing on fine-grained incremental checking first.
Is this still a concern? I believe we now have a SQL store for the cache as an option. If it is still a concern, I do like the idea of using __getstate__
/__setstate__
.
We now have an option of using sqlite for storing cache information, which is helpful. Deserialization performance is still an issue for large builds, though.
Using __getstate__
/__setstate__
can be tricky because we must allow cache files to be loaded incrementally, and cache files can have references to things defined in other cache files. Import cycles can have cyclic references.
Another idea not discussed above is to deserialize cache files lazily. For example, consider a module m
that (directly or indirectly) depends on 5000 other modules. Currently we will deserialize all the 5000 modules if m
is dirty. However, usually most of the dependencies are not actually required to type check m
. We could plausibly delay the deserialization of a module until we actually need some definition from the module while type checking. The details are still hazy, but this seems like a big potential win, though it's non-trivial to implement.
Our use of SQL is totally half-assed, though. We have a table with filename, data, mtime columns and we store our json blobs in it.
Hi, is there any way to implement custom serialization/deserialization like pickle? In our case loading caches from disk takes almost as long as generating them anew.
Another option is to write a custom binary serializer, but it's kind of awkward to implement right now. Once mypyc adds better support for encoding and decoding binary data, this should be feasible, and it would be a real-world use case for the relevant, though currently still hypothetical, mypyc functionality. This would have to be designed carefully to avoid complicating the maintenance of mypy serialization code, if we decide to take this route.
Faster cache deserialization could help once we have parallel type checking (#933), at least if parallel workers would use the mypy cache to share data with each other.
Cache size reduction is a related issue (#15731).
A faster, custom binary cache format is a promising way to speed up deserialization. It can be combined with some of the other ideas discussed above.
Constructing objects is likely a big bottleneck. A JSON parser constructs a string object for each dictionary key, and these are discarded afterwards. Also, many of the list and dictionary objects produced during JSON parsing are discarded during deserialization. A custom deserialization format would let us avoid most of this overhead.
I did a small experiment, which suggests that JSON parsing likely accounts for roughly half of the time spent in deserialization during self checking. Another interesting observation is that most of the strings in the data occur at least twice per file.
I will explain my proposal for the binary format next.
Each serialized value can be encoded as a 8-bit type tag followed by type-specific binary data. Values can be either primitives (int, float, str, bool, None, list, or dict) or instances (e.g. TupleType).
Each primitive type has a separate encoding. For example, int would be encoded as a variable-length sequency of bytes. For example, if the high bit is set, additional bytes will follow.
Instance data has a common encoding as (name, value) fields that represent attributes. Attributes are encoded as a single byte tag representing the name, followed by the value. The value uses the general value encoding. The (name, value) fields must be in a specific order, but some may be omitted, depending on the values of earlier attributes. The field name tags are shared between instance types, so that each instance type with a name
attribute will use the same tag for name
.
Since most string values tend to have duplicates in the same cache file, sharing string objects would improve efficiency. Each string would either be encoded as length followed by contents, or as a reference to a previously decoded string. Using references to strings lets us share string objects, which would both speed up decoding and reduce memory use, at the cost of slightly slower encoding.
It may also be worth it to have an efficient encoding of shared module name prefixes of fullnames, since many fullnames have long -- but shared -- module name prefixes. This doesn't seem essential, though.
This format has some nice features:
There are some drawbacks and limitations as well:
Module mypy/serialize/fields.py
would define all the field tags:
# Field tags (must be unique and in range 0..255)
name: Final = 0
line: Final = 1
items: Final = 2
...
Module mypy/serialize/types.py
would define all the type tags:
# Type tags (but be unique and in range 0..255)
instance: Final = 0 # Instance
tuple_type: Final = 1 # TupleType
...
Some type tags would be reserved for primitives, such as str
.
Serialization and deserialization methods could look like this:
from mypy.serialize import fields, types
...
class TupleType:
...
def serialize(self, w: Writer) -> None:
w.write_type_tag(types.tuple_type)
w.write_type_list_field(fields.items, self.items)
w.write_instance_field(fields.partial_fallback, self.partial_fallback)
w.write_bool_field(fields.implicit, self.implicit)
@classmethod
def deserialize(cls, r: Reader) -> Instance:
r.read_type_tag(types.tuple_type)
return TupleType(
r.read_type_list_field(fields.items),
r.read_instance_field(fields.partial_fallback),
implicit=r.read_bool_field(fields.implicit),
)
Currently, performance with the binary format could probably be better than JSON when compiled with mypyc, but worse without compilation. With further mypyc improvements, we can probably squeeze in a bunch of additional performance. It's too early to give precise estimates, but I expect that we can make deserialization 2x to 3x faster with a binary format (assuming some new mypyc features).
Deserialization of cache files is a major bottleneck in incremental mode, so any performance improvements there would be valuable.
At least @pkch has previously worked on this, and it's unclear if any significant wins are achievable without a lot of work. However, there seem to be some redundant information in the cache files that we haven't looked at eliminating yet.
Here are some random ideas:
int
andstr
."defaults": true
to mark that we can skip processing non-default values altogether.Idea for how to move forward:
In any case, this issue isn't important enough to spend a lot of time on it. We have other plans to speed up mypy, but as they will take more effort, so some quick wins here would be nice.