Model loading is very memory hungry

ianroberts commented 3 years ago

Taking the spoken Italian model as an example, the process of loading the model into memory (ASPTagger.load) causes memory usage of the python process to briefly rise to nearly 4GB. Once the model is loaded memory usage drops to a more reasonable 1.7GB and remains there in the steady state.

The format used to store models on disk is gzip-compressed JSON, with the weight numbers stored as base85-encoded strings. This format is rather inefficient to load, since we must

load the entire JSON array-of-arrays into memory
duplicate the vocabulary list to turn it into a set
zip together the parallel lists of feature names and weights, and for each entry base-85 decode the weight and add the pair to a dict
then throw away the original lists of vocabulary, features and weights

If the feature name/weight pairs were instead serialized together (either as a {"feature":"base85-weight",...} object or as a transposed list-of-2-element-lists) then it would be possible to parse the model file in a single pass in a streaming fashion, eliminating the need to make multiple copies of potentially very large arrays in memory.

tsproisl commented 3 years ago

I've been unhappy with that for a long time. There are mainly two reasons why I haven't changed the model format, so far:

While it is annoying, it does not seem to be a problem in practice as most people seem to have enough RAM (or do not complain if they haven't).
I don't want to render any existing models useless.

Of course, dealing with 2 is just a matter of making the tagger recognize the format and handle the model file appropriately. It's just that it hasn't been a top priority for me.

ianroberts commented 3 years ago

Sure. It has only become an issue for me because I'm working with a project that wants to expose a web service based on your tagger in a platform that uses Kubernetes. I need to apply memory limits to the pod definitions but for this service I have to make the pod request 4GB even though it only needs 1.7GB after the startup phase.

For this particular use case I've developed a workaround where I transform the model into a gzipped pickle format file, which is quite a bit larger than the original gzipped JSON but loads faster and with virtually no additional memory overhead. However it occurred to me today that it's actually possible to implement a more efficient streaming load of the current model format using ijson, I can submit a PR for this if you like?

tsproisl commented 3 years ago

Ah, the ijson solution is nice! A PR would be most welcome. The only thing that needs to be taken into account is that this will produce garbage on Python versions <3.7 that have ijson installed. I see two possible solutions: Either always fall back to the standard parser for these older versions or use a collections.OrderedDict instead of a dict if the version is <3.7.

ianroberts commented 3 years ago

PR submitted - I've made it use the optimised algorithm on CPython 3.6+ or (any) Python 3.7+, which are the ones where dict iteration order is guaranteed, and fall back to the original algorithm on earlier versions.

tsproisl commented 3 years ago

Thank you! I've updated the README and created a new release.

tsproisl / SoMeWeTa

Model loading is very memory hungry #9