tsproisl / SoMeWeTa

A part-of-speech tagger with support for domain adaptation and external resources.
GNU General Public License v3.0
22 stars 3 forks source link

Alternative model loading logic using ijson #10

Closed ianroberts closed 3 years ago

ianroberts commented 3 years ago

Load model files using the ijson streaming parser, if available, which avoids the need to hold multiple copies of large arrays in memory during the loading process. For a model like the spoken Italian one, this reduces the peak memory usage during model loading from over 3GB to around 1.5GB - little more than the model requires long-term once fully loaded.

Note: this algorithm requires that dict iteration order match insertion order, which is only guaranteed starting from Python 3.7 per spec and CPython 3.6 per implementation. On older Python versions, or if ijson is not available, fall back to the previous loading algorithm.

Closes #9

tsproisl commented 3 years ago

Great, thank you!