openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 421 forks source link

Reduce memory footprint of libpostal #627

Closed walkman-kuan closed 1 year ago

walkman-kuan commented 1 year ago

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

Canada


Here's how I'm using libpostal

I plan to create a C++ application that uses libpostal to parse international addresses. The C++ application will be running on a fleet of Linux servers, each with 8GB of memory.


Here's what I did

Not really what I did, but I have two questions

Q1: Why does libpostal always load the 1.8G trained model into memory? Can the model be split into smaller parts, which can be loaded when needed? In my case, 1.8GB is ~25% of the memory of our Linux computer. It sounds weird to allocate 20% of memory of a computer to do address parsing.

Q2: In the Why C section of the README file, we mention that

Memory-efficiency: libpostal is designed to run in a MapReduce setting where we may be limited to < 1GB of RAM per process depending on the machine configuration. As much as possible libpostal uses contiguous arrays, tries (built on contiguous arrays), bloom filters and compressed sparse matrices to keep memory usage low. It's possible to use libpostal on a mobile device with models trained on a single country or a handful of countries.

  1. Is libpostal considered memory-efficient if it always loads a model of 1.8GB into memory regardless?
  2. Do we have instruction on how to train the libpostal model with a handful of countries, rather than all countries in OSM?

Here's what I got

N/A


Here's what I was expecting

N/A


For parsing issues, please answer "yes" or "no" to all that apply.

N/A


Here's what I think could be improved

  1. Avoid loading the entire 1.8GB model into memory
  2. Provide instruction on how to train the libpostal model with a handful of countries, rather than all countries in OSM
walkman-kuan commented 1 year ago

It turns out that this problem has been solved in Splitting data files by country and language.