Closed ianroberts closed 3 years ago
I've been unhappy with that for a long time. There are mainly two reasons why I haven't changed the model format, so far:
Of course, dealing with 2 is just a matter of making the tagger recognize the format and handle the model file appropriately. It's just that it hasn't been a top priority for me.
Sure. It has only become an issue for me because I'm working with a project that wants to expose a web service based on your tagger in a platform that uses Kubernetes. I need to apply memory limits to the pod definitions but for this service I have to make the pod request 4GB even though it only needs 1.7GB after the startup phase.
For this particular use case I've developed a workaround where I transform the model into a gzipped pickle format file, which is quite a bit larger than the original gzipped JSON but loads faster and with virtually no additional memory overhead. However it occurred to me today that it's actually possible to implement a more efficient streaming load of the current model format using ijson
, I can submit a PR for this if you like?
Ah, the ijson
solution is nice! A PR would be most welcome. The only thing that needs to be taken into account is that this will produce garbage on Python versions <3.7 that have ijson
installed. I see two possible solutions: Either always fall back to the standard parser for these older versions or use a collections.OrderedDict instead of a dict if the version is <3.7.
PR submitted - I've made it use the optimised algorithm on CPython 3.6+ or (any) Python 3.7+, which are the ones where dict iteration order is guaranteed, and fall back to the original algorithm on earlier versions.
Thank you! I've updated the README and created a new release.
Taking the spoken Italian model as an example, the process of loading the model into memory (
ASPTagger.load
) causes memory usage of the python process to briefly rise to nearly 4GB. Once the model is loaded memory usage drops to a more reasonable 1.7GB and remains there in the steady state.The format used to store models on disk is gzip-compressed JSON, with the weight numbers stored as base85-encoded strings. This format is rather inefficient to load, since we must
vocabulary
list to turn it into a setIf the feature name/weight pairs were instead serialized together (either as a
{"feature":"base85-weight",...}
object or as a transposed list-of-2-element-lists) then it would be possible to parse the model file in a single pass in a streaming fashion, eliminating the need to make multiple copies of potentially very large arrays in memory.