openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.02k stars 417 forks source link

Splitting data files by country and language #132

Closed rinigus closed 6 years ago

rinigus commented 7 years ago

To be able to use libpostal on mobile, it would be advantageous to split data files by language and country. Is such split possible?

albarrentine commented 7 years ago

It is indeed possible, although it would involve training per-language/per-country models on subsets of the address corpus rather than splitting the existing data files. Training a model is relatively fast at present and in theory that should only roughly double the training time.

There have been a few different types of requests regarding single-language models. Some people want the smaller files/lower memory requirements, while others want to control parser performance e.g. a US-only parser would probably not mistake a 4-digit house number for a postcode whereas a global model that's seen many international postcodes in the Netherlands, Denmark, Norway, etc. might occasionally make that error.

Deployment is also a consideration. The simplest way would be to publish several versions of the model files to S3, keep libpostal model-agnostic and have a configure option to set the model to be used. So that the number of models doesn't balloon, we'd probably only want to publish new models as use cases come up (some people may want versions with multiple countries e.g. a US/Canada model) rather than trying to guess.

Not sure whether doing this would make the models small enough for mobile. What would be the tolerable limit for libpostal's memory-resident data structures on a modern-ish device with OS + other applications potentially running and presumably some memory for the application's own map tiles etc?

rinigus commented 7 years ago

To train several combinations would not be a problem if there is a documentation on how to do that. Frequently, we have to convert OSM-provided data into app-specific format anyway, so that could be a part of conversion as well. Same goes for model deployment - same problem is faced when we distribute the maps.

I am developing a mobile-targeted server that provides map tiles, search, and routing services. Its a thin layer on top of libosmscout and it already works quite well. We work on Sailfish - Mer-based Linux distribution. So, we have full Linux environment that could be an advantage in getting libpostal working there.

As for RAM requirements: my server right now uses 350-700MB RSS RAM, depending on tile-generation parameters and such. In general, we have "normal" devices with 2GB RAM. Since there is only one user expected, I could also drop the database while applying libpostal magic and load it on again after that. So, we may figure out some scheme to reduce memory pressure in case if its needed.

I am working on testing the full-blown libpostal on ARM, just from command line. I have already bumped into one gcc bug while compiling scanner. Let me get it working compiled and tested in full and maybe we could take split-training after that. I'll report on my progress over here and, if there are any bugs on the way, would report them separately ass well.

rinigus commented 7 years ago

OK, phase 1 done. I managed to compile it for ARM using Sailfish SDK. For that, I had to have in environment

ac_cv_func_malloc_0_nonnull=yes ac_cv_func_realloc_0_nonnull=yes

Before configure step.

In addition, to get around gcc bug while compiling scanner, I had to add -marm as an additional option.

On device, while running address_parser, it uses about 950MB RSS RAM. When loaded, its quite snappy.

So, in theory, its all possible. Now, how can I try to make separate language-countries to see if it does reduce the RAM and HDD requirements?

albarrentine commented 7 years ago

I'm currently finishing up a major update to the parser, which is in a branch. A lot of the work that's gone into that is in transforming source addresses from OSM and OpenAddresses into clean training examples for the learning algorithm to use (Python codebase), but I'm also making some improvements to the machine learning model used at training and runtime (C codebase).

Which country/language did you have in mind? Can publish a subset of the new training data to S3 if you want to train it and look at file sizes (for the address parser, the file size on disk should be within a kilobyte or so of memory usage as almost everything is stored as compact, contiguous arrays).

The one thing that's going to be problematic is the geodb module of libpostal, which is used by the current parser. geodb takes up half of the disk space (~1.2G) and a quarter of the memory (~500M) used by libpostal regardless of the training set used.

The dependency on geodb is in the process of being removed (which also means no Snappy requirement and maybe easier compilation on various platforms) since the new training data can cover the same things, but that will have to wait until the new release is ready. The latest versions I've trained of the global address parser are around 750MB, so it's conceivable that even the global model could be usable on mobile after the dependencies are dropped.

rinigus commented 7 years ago

Looks like we are having discussion over three issues :)

I thought that we could split geodb as well, in addition to language sets. So, I was thinking that I could take your scripts for learning and run them over a selected languages-country by myself. That would lead to a smaller geodb and language module and show whether we could actually use it.

Maybe you could guide me on how to proceed with the testing? I would like to establish on whether I can use libpostal for advancing search in our application. There are probably two ways of proceeding:

  1. I have to read up on the training and generation of the datasets. Do you have somewhere the process description in terms of which scripts to run and against which data? Using this approach I can stay on released version and later move to the new one.

  2. I move over to parser-data branch and test that. As far as I understand, that would include new global address parser and would drop geodb requirement. And all the tests would be against the branch that you most care about since you would probably use it as a master later.

As for country/language: I wonder how large would be EU?

albarrentine commented 7 years ago

Oh, so the geodb is independent of the training set and has a fixed size. So that might be a deal breaker on testing much before the parser-data branch is merged.

All-EU would encompass a majority of the OSM training data. I suspect that percentage of training data isn't quite proportional to model size (some languages like Finnish, German, English, etc. have more lexical diversity than others and will likely take up more model space) but that assumption might still be roughly reasonable.

There are 317709943 examples in the most recent OSM training data that was built, so that is the denominator to use (there's also a larger OpenAddresses training set now though that's mostly US/Canada). Including the entire training set produces a 750M parser model. If the language size distribution assumption holds, an all-EU model would take up about 64% of the space of the global model so ~480M.

Here are the number of OSM training examples per country for each of the 28 EU states (including the UK for the moment) to do that calculation for different subsets:

58396697    de
41832868    nl
33411161    pl
13897377    cz
9424239 dk
7556843 gb
7515260 fr
5752519 be
5714532 at
5350703 it
3046208 es
2340262 se
2132447 sk
1099762 ro
1093965 ee
1033629 fi
968879  hu
822517  ie
667369  lt
388411  pt
375966  lv
357681  si
344538  hr
300358  gr
263842  bg
193886  lu
37044   cy
10450   mt

parser-data is an unstable dev branch and is not compatible with the old model files on S3. Recently, it's been more often used to deploy changes in the Python portion of the codebase from my local machine that need to be pushed to a more resource-heavy research instance to generate the next batch of training data. Sometimes a batch of C library commits I was working on get pushed with the more immediately-needed Python commits, so the branch may not always compile at a given point in time.

If I publish the OSM training data (about 27G, and that would be subject to ODBL, though the libpostal models are not) or a subset thereof, you can download it, slice by country or language (it's a TSV file where each row is language, country, tagged address), and train the model on that subset using the code in master.

Big disclaimer: I'm only going to be able to help diagnose parser accuracy issues, etc. for people using the official out-of-the-box model(s) from libpostal. I do not by any means claim that libpostal can be trained on arbitrary address data sets with no tuning. Doing so might invalidate model assumptions, result in lower accuracy, and it may require some knowledge of NLP or machine learning to diagnose. It would be unreasonable to expect me to solve everyone's data cleanup/custom training problems for free.

rinigus commented 7 years ago

Thank you very much for detailed explanation!

RE disclaimer: Sure, that's fully understandable. I know that you are mainly interested in whole-planet case and mobile is not top priority for you. From my point of view, I am happy to help with your open source project and, as a part of the project development, I prefer to submit issues than to sit and fix them without committing upstream, to your project. I am not a specialist in NLP, but I think I understand your concerns.

Re geodb and parser-data: The big question is if you have any ETA for the merge of a new parser-data into master? At least, what do you expect at this stage? Since you expect to have much smaller requirements for this branch, it would be best to test mobile performance on that. And let's see if I would have to get into all this dataset training business...

rinigus commented 7 years ago

On a second thought, after reading a bit of libpostal code, it would be very valuable if you could publish OSM training data. I would like to test it by slicing the data and see what comes out of it.

I hope you don't mind to reply to some questions regarding the libpostal training.

As for license, I presume you mean http://opendatacommons.org/licenses/odbl/summary/ . That is fine, sure.

rinigus commented 7 years ago

With the OSM training data, would you mind also to publish geonames.tsv? Or is that generated from OSM map data with some script?

albarrentine commented 7 years ago

No worries. Training data publishing is in the works.

That disclaimer is there mostly because once this is on the Internet, people will inevitably try to train their own models on proprietary data and then come here with issues when it doesn't work as well as expected - and that I'm just plain not interested in :-). Subsets of open data sets generated by libpostal are not as much of an issue, although I likely won't have time to build every unique permutation.

Don't have an ETA at the moment for the merge into master. There are a number of improvements related to place name consistency, unit/sub-building information, etc. that were requested by the Pelias team, and those have been fixed in or added to the training data. The models in master were trained on the new and improved data, and that model was published here: https://libpostal.s3.amazonaws.com/mapzen_sample/parser_full.tar.gz. That would be fine to merge except that it causes regressions in other areas, including a few functional test failures, so the answer is basically "as soon as the tests pass" but getting them to pass is a more intricate business than fixing normal bugs. Usually in machine learning systems, we worry more about aggregate performance measures than specific test cases, but since most users of libpostal are not as familiar with machine learning, we define certain test cases that we really don't want model updates to break, e.g. the addresses in the GIF, etc.

rinigus commented 7 years ago

When I was looking through the code, it seemed to me that its possible to make the learning datasets on the basis of available open data already. Its clear that the result is probably not as good as you have after tuning, but its probably better than the search that we do now in other projects. For sure, its great to learn libpostal internals and maybe stress-test it against naïve user. So, I am planning to regenerate some sets myself and see whether its of any use. I don't expect any support - I just hope it will be useful for my projects and, hopefully, to libpostal, in general.

I will work with the master since its stable.

From the limited reading of the code, it looks to me that geodb size can be changed if we use smaller geonames.tsv and postal_codes.tsv . So, if use inputs for a country, then I should get the names / codes for that country only and I expect that the database would shrink as well. That should resolve the main issue I have encountered so far and, through country-based splitting, open libpostal to mobile. I'll try to test it if I get time later this week or maybe, due to approaching season, early next year (have to finish few other things first)

albarrentine commented 7 years ago

Ok, so generating the data sets from a raw OSM dump is possible, but may require a very large machine (like one with ~64GB of memory) and about 7-8 days of compute time for planet, and there are several indexes that need to be built first. In an all EU model on some of the smaller OSM dumps coming from geofabrik or something like that, maybe cut those numbers in half.

It may be tempting to try forgoing all of the normalization and just mapping the values of OSM addr:* tags to that libpostal/address-formatting schema. That unfortunately is fraught with problems, many of which I've detailed in the libpostal blog post. One of the big ones is that libpostal relies heavily on gazetteers (memorization) for city names, etc. and at training time it needs to have seen basically every city that it's expected to parse at runtime. This includes name variations, so if for instance only München is seen at training time and the user types Munich, the parser almost certainly would get that wrong, whereas a traditional TFIDF-based search system might not. The addr:city tags can be pretty sparse. Around 76% of OSM tags that include addr:street also include addr:city, which is not bad overall, but isn't applied consistently across cities, and non-local names like Munich are usually not included, so it might miss some conspicuous examples, especially in a general routing/search application where you need to be able to find the tiny cities, hamlets, etc. or need it to work for tourists as well as locals.

The other area that doesn't work well using vanilla OSM tags is venue search due to the fact that only 9-10% of OSM venues have any address tags, so the parser won't see many examples of "venue + street address", "venue + city", etc. Usually it will just see venue by itself, and, because the model predicts from left to right and uses its previous predicted tag to inform the next, it will often lump the entire string into the venue name. For OSM we combat this by actually reverse geocoding venues to both admin boundaries and building polygons, which often do have address tags, and adding addresses that way.

Reverse geocoding to the admins is what takes up the majority of the memory and time in building the training set, but it's also one of the largest contributors to the parser's accuracy and robustness. Much of this "hydration" is quite similar to what a geocoder like Nominatim does in the results you see when searching OSM.

There are also dozens of different random normalizations that we apply so variations of abbreviated/hyphenated city names (Saint Pierre, Saint-Pierre, St.-Pierre, St. Pierre), variations with and without accents (München, Munchen, Muenchen), etc. are seen in the training data.

Because it's fairly resource-intensive and a reasonably heavy cognitive load, I thought it might be better if I just push out the resulting planet training data files when the next batch finishes building and let people slice and dice it from there using grep/awk, etc. It would be great to have more eyes on the training data to spot any systematic issues. The process of training the model on a given segment of the input is just ./src/address_parser_train input_file output_dir, which is substantially easier and can be run in parallel for different models.

Just checked and the geonames/postal_codes TSVs are indeed splittable by country as well. Those can be downloaded from S3 now: geonames.tsv and postal_codes.tsv. License there is CC-BY, same as GeoNames.

The latest batch of training data is still being generated, but will publish it to S3 sometime next week most likely.

rinigus commented 7 years ago

Thank you for very detailed explanation, its of great help!

Your plan is much better - I'll wait for the training datasets then.

I can start then with the geodb component already. When testing on PC, the loading of geodb by address parser took the longest time. So, by reducing this part of the data and preserving full trained model can already be tested on mobile. I'll let you know how such permutation will go.

rinigus commented 7 years ago

I made a quick test: imported EE DK SE FI DE leading to 45MB geodb data. As a result, on mobile (ARM), address_parser RAM usage dropped from 950MB RSS RAM to 450-500 RSS RAM. And that's with the full model! Performance, as earlier, is fast and there are no problems with parsing addresses using it. So, we are surely getting into some very attractive state.

As you explained earlier, by splitting the model as well, we should expect to major drop in RAM usage as well. I think it strongly suggests that already the current version of libpostal is mobile-ready, I just have to make scripts parsing / splitting the data and train country-language-specific models. They will not be as good probably as a tuned one, but I am sure it would be helpful when you get stuck without reasonable network access.

Thank you very much for helping me! I look into how to access libpostal from our applications and work on libpostal patches to make it compile on our platform. And let me know when the datasets for training are downloadable :)

Edited: corrected RSS to a range 450-500 MB

rinigus commented 7 years ago

@thatdatabaseguy, good morning! Have you had any chance to upload the training data or its not ready yet?

Best wishes,

rinigus

albarrentine commented 7 years ago

Hey @rinigus,

A 450-500MB drop is impressive. The global address parser model will almost definitely grow in size when it's trained on all of the new data sets, which are > 2x the size of the original training set, so it might not fit on mobile, but hopefully the overall in-memory size of libpostal's models will stay the same or slightly decrease when geodb is removed.

There's a training set building currently, but I've uploaded files from one of the earlier runs to S3.

  1. OSM training addresses (27GB, ODBL) This is (a much-improved version of) the original data set used to train libpostal.

  2. OSM formatted place names/admins (4GB, ODBL) Helpful for making sure all the place names (cities, suburbs, etc.) in a country are part of the training set for libpostal, even if there are no addresses for that place.

  3. GeoPlanet postal codes with admins (11GB, CC-BY) Contains many postal codes from around the world, including the 1M+ postcodes in the UK, and their associated admins. If training on master, this may or may not help because it still relies pretty heavily on GeoNames for postcodes.

  4. OpenAddresses training addresses (30GB, various licenses): By far the largest data set. It's not every source from OpenAddresses, just the ones that are suitable for ingestion into libpostal. It's heavy on North America but also contains many of the EU countries. Most of the sources only require attribution, some have share-alike clauses. See openaddresses.io for more details.

There are two more data sets that are as yet unpublished, one with OSM street intersections (to help with e.g. Manhattan where some geocoder queries might be like "14th & 5th Ave") and one with plain street names (to help shore up countries where there aren't a lot of addresses in OSM but we have the road network).

In the versions on S3, the lines are shuffled randomly. For online machine learning algorithms like the ones used in libpostal, it's important that the training data be in random order (if not, cycles can form in the model's weight updates). Ideally, random shuffling should be done on each pass through the data, and on Linux the training script will use shuf before each iteration. On Mac, most of the time shuf is not available, so having pre-shuffled data is more-or-less sufficient though not ideal and address_parser_train will generate a warning if shuf is not installed. shuf is quite fast because it loads the file into memory, but after the introduction of the OpenAddresses data set, the combined size of the global training data starts to be larger than main memory on most systems, so I've developed a lower memory method for randomly shuffling files that are larger than memory. That script can be found in this gist. It probably won't be needed for training a smaller model, but will be used to train globally in the next release.

rinigus commented 7 years ago

Thank you very much! Fortunately, I am using Linux, so all the programs should be OK (shuf is installed). So, if I understand correctly, to train the addresses, I will have to

  1. from OSM training addresses, OSM formatted place names/admins, and OpenAddresses training addresses: take out the data from countries that I am interested in. Country is given in the second column.

  2. cat the resulting datasets together into a single file (AddrTrain.tsv, for example).

  3. run address_parser_train filename output_dir (address_parser_train AddrTrain.tsv output)

  4. copy resulting address trained dataset into /usr/local/share/libpostal/address_parser

I hope that this sequence is OK. Let's see if I will run into any problems with the cut datasets.

Now, I can see that there is also language_classifier dataset. How do I train that? It also seems to require some data for it. Or are these files missing?

The download would take some time. But I hope that I could update you soon on how it went. Thank you again!

rinigus commented 7 years ago

Looks like stumbled on some small bug:

  1. address_parser_train.c was missing transliteration_module_setup. I have added one block from libpostal.c after geodb module.

  2. on line address_parser_t *parser = address_parser_init(filename), the parser crashes due to

Starting program: /home/rinigus/code/libpostal/src/address_parser_train ../../postal-data/addresses/estonia/address_train.tsv ../../postal-data/addresses/estonia/output
INFO  address dictionary module loaded
   at main (address_parser_train.c:451) 
INFO  geodb module loaded
   at main (address_parser_train.c:458) 
INFO  Loading transliteration module
   at main (address_parser_train.c:460) 
INFO  transliteration module loaded
   at main (address_parser_train.c:467) 
*** Error in `/home/rinigus/code/libpostal/src/address_parser_train': double free or corruption (fasttop): 0x0000000001cbe240 ***

Backtrace returns

backtrace
#0  0x00007ffff7532418 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff753401a in __GI_abort () at abort.c:89
#2  0x00007ffff757472a in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffff768d6b0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff757cf4a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7ffff768d778 "double free or corruption (fasttop)", action=3) at malloc.c:5007
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3868
#5  0x00007ffff7580abc in __GI___libc_free (mem=<optimized out>) at malloc.c:2969
#6  0x00000000004330ab in tokenized_string_destroy ()
#7  0x0000000000402a0d in address_parser_init (filename=filename@entry=0x7fffffffe771 "../../postal-data/addresses/estonia/address_train.tsv") at address_parser_train.c:235
#8  0x00000000004015ea in main (argc=<optimized out>, argv=<optimized out>) at address_parser_train.c:469

Is this problem known?

albarrentine commented 7 years ago

Language classifier is required only for normalization and if you're not using expand_address, it's not required at all. If the language is always known at runtime (you can guarantee that the expand_address function will never be passed NULL), then its setup functions can be removed without a problem.

I also don't have recent copies of the data for the language classifier at the moment, so that will have to wait a bit if needed. Try to get the parser working first. I'll be testing out a sparser model for language classification after I'm finished with the parser updates.

rinigus commented 7 years ago

OK, no problem. As far as I remember, by specifying language I could get much faster loading time. I presume that also would limit RAM usage. So, let's follow your advice and focus on addresses then. But there I am hit by the bug...

The training program crashes on the first record. I used formatted_places_tagged.random.tsv and the start of the file is

en      ee      Tartu/city linn/city |/FSEP Estonia/country
et      ee      Soomevere/city küla/city |/FSEP Eesti/country
et      ee      Tsirksi-küla/city |/FSEP Eesti/country
en      ee      Sutlema-küla/city |/FSEP Estonia/country
et      ee      Hurda/city |/FSEP Eesti/country
albarrentine commented 7 years ago

Hm, so transliteration_module_setup definitely needs to be in there. The version I used to train the latest batch of models may include a commit or two from parser-data that never got merged in. Will find the commits and add them in to master.

albarrentine commented 7 years ago

Ok, with a little git cherry-pick fu, was able to merge the relevant commits from parser-data into master with passing tests. Parser training works with no valgrind errors, saves the model, can then load and use it in the client. Pull the latest from master and you should be all set.

Note: I'm assuming you're not doing training in a mobile-like environment, but that would probably not be a good idea as it takes more memory to train than to use the resulting model. You can train on your laptop or whatever and just use the resulting file. The model files are stored in a platform-independent way so it doesn't matter if it was trained on a different flavor of Linux, etc.

rinigus commented 7 years ago

Thank you very much, I'll try later today on our server. No-no, I don't do training or dataset generation on mobile - I use PC or a server for that :) . And indeed, its no problem to drop the generated files to mobile and get them up and working on ARM. As far as I remember, MARISA cared about endianness, but my mobile is little-end, as PC.

albarrentine commented 7 years ago

Awesome.

In fact that also wouldn't be a problem for libpostal as the file formats are endian-agnostic.

rinigus commented 7 years ago

Took a bit of a time and I will test it further, but let me report the first results.

I used EE DK SE FI DE geo and address_parser. Address_parser was generated from

formatted_addresses_tagged.random.tsv 
openaddresses_formatted_addresses_tagged.random.tsv 
formatted_places_tagged.random.tsv

For these countries, address_parser RSS is 290MB on mobile. I will test later what's a contribution of Germany separately, to see if I can get it lower. Countries like Estonia would bring RSS down to 61MB, so, I guess, its minimum that can be achieved.

I wonder whether I can reduce the parser by reducing the training set. It would probably become not as advanced, but maybe would still give a significant improvement compared to what we have.

I'll continue further with the tests and let you know how its going.

rinigus commented 7 years ago

Just extra note: by combining all these sources, DE training set has 63155258 records which is larger than the one used by you, as far as I could see from the list above.

albarrentine commented 7 years ago

The above was using just the OSM training set, so most of those numbers will increase on the combined set.

Yes, that particular set contains 1) some of the more well-represented countries in OSM like Germany and Denmark, so there's a lot of training data, and 2) morphologically-rich languages like Finnish and Germanic languages with lots of compound nouns.

As such it wouldn't be surprising if a plurality of the parameter space went to handling words from those five languages.

I trained a Russia-only parser yesterday to test something, and it was also ~60M. That might indeed be the minimum for a well-represented, sufficiently large country (obviously if it's for Tuvalu I'm sure the model would only be like 1MB because there are only a few roads and a handful of buildings).

rinigus commented 7 years ago

Just a short note: I am working on incorporating libpostal to the geocoding part of the package that we use for offline maps. Its quite non-trivial and would take some time. I'd like to make a prototype with libpostal working as a part of geocoding. Or maybe you know some geocoder that I could use on mobile which would be easy to extend with libpostal?

Anyway, just to let you know that the work is underway and it would probably take some weeks before I can get something working well. Thank you for your help!

albarrentine commented 7 years ago

Mobile geocoding is not really an area I know much about, but seems like some kind of on-disk search index would be needed to store and query the data. This could be an on-disk hashtable like LevelDB, a full-text index like Lucy, a database like SQLite, or some custom on-disk index.

As far as existing offline geocoders, one of the better-known ones is called MAPS.ME. It's open-source and in C++, although I don't know if it would necessarily be easy to plug libpostal into the otherwise rule-based system. Still, might offer some inspiration. Their geocoder is here: https://github.com/mapsme/omim/tree/master/search

rinigus commented 7 years ago

Thanks! OK, I am aware of MAPS.ME, but I'll better stick to the library that I use and maybe roll out with its help some SQLite database for experimenting when coupled to libpostal.

rinigus commented 7 years ago

Hi! I am making a slow progress on implementing a geocoder based on libpostal. While there is still some way to go, maybe we could continue a discussion on reducing RAM footprint of libpostal. Namely, I wonder if I could use mmap instead of copying data from storage into the RAM while using libpostal on mobile devices?

Since I plan to normalize all addresses while composing SQLite database on PC, on mobile, libpostal would have to normalize and parse addresses only typed in by user. So, some performance hit should not be a problem.

So, what do you think, if its possible to mmap some datasets which use most of RAM in libpostal? I am looking for datasets that are read-only and in the same form on disk as in RAM (not compressed on storage, for example).

Best wishes,

rinigus

albarrentine commented 7 years ago

mmap is not going to happen. The data files aren't simply copied into memory. There are many data structures built on those files, and since the on-disk formats are endian-agnostic and store doubles as unsigned 64-bit ints, etc. simply accessing raw bytes with mmap would mean rewriting every function that thinks it can access an array and get a double, which is much of the runtime code base. Plus having two separate versions of every data structure for mmap and in-memory respectively would, in practice, mean that I have to maintain both when making any changes to the models, which I definitely don't want to do. There have also been some people interested in getting libpostal to build on Windows, and mmap makes life much more difficult on that platform. In the next release we drop geodb, and thus sparkey, and thus its dependence on mmap, so libpostal could potentially be easier to build on Windows (though it's not a priority).

There's not a ton of savings to be realized from data structures. The main components of the parser are a double-array trie and a sparse matrix stored in compressed sparse row (CSR) format, which are about as efficient as it gets.

So, I'm sorry, but there's not really much more that can be done in the immediate term beyond training on smaller data. I'm experimenting with some of the modeling choices in libpostal at the moment, and I'll try to take your use case into account, but the main goal right now is accuracy rather than model size.

rinigus commented 7 years ago

No problem! I still think that I can make it work for many devices using the current footprint. Its just I'll have to unload other databases in the application while normalizing and parsing address. Then unload libpostal and load all other databases again.

I think I did not mention it over here: when I use Germany separately, RSS is about 240 MB on mobile. So, that's quite usable already.

rinigus commented 7 years ago

To support multiple subfolders with the data for corresponding countries, I would like to add an argument to libpostal_setup and similar functions allowing to specify the directory with libpostal datasets. What would be the way that you would prefer this to be handled? If I would know a way that would be proper for you, I can implement it and submit as a pull request.

albarrentine commented 7 years ago

If you're implementing in C/C++ that should already be possible using this method (see address_parser.h):

bool address_parser_module_setup(char *dir);

You're welcome to add a version of that to the public header file if needed, just add a new function like libpostal_setup_address_parser_dir(char *dir) rather than changing the existing one, or it'll break all the bindings, some of which are written externally.

We use a global variable for the parser as the primary use case is calling libpostal from higher-level languages rather than from C, so the parser gets loaded at "import" time and then the binding languages can call a simple function without having to wrap the C pointers in their own objects, etc. Perhaps a better way to do that would be to have libpostal_setup return a struct whose members are the various model pointers and not muck around with global variables, but with the current implementation, only one parser can be loaded at a time.

Is this something you want to switch out at query time or, say, when the user switches country?

rinigus commented 7 years ago

Excellent! I do work in C++, so its no problem with adding _dir versions. I presume that the following would be needed:

libpostal_setup_dir
libpostal_setup_language_classifier
libpostal_setup_parser

This would allow me to either load during query one-by-one or allow user to specify the country of interest. The teardown functions are in this sense easier - no argument is needed :)

rinigus commented 7 years ago

I could see that you implemented these functions already, thank you!

I wrote a small script ( https://github.com/rinigus/geocoder-nlp/blob/master/scripts/postal/build_country_db.sh ) for generating country-specific datasets. I would still have to test it a bit, but it should be ready soon.

Are you interested in getting it into libpostal? If you do, where I should locate it in the tree?

albarrentine commented 7 years ago

Yah, it's available for C/C++ and Java users in any case.

I can refer people to that script if they ask, but as I've said before, I don't want to directly encourage people to train their own models unless they have experience with NLP. Doing so would be to imply that the model will work on any subset of the data and that all you have to do is run the script. This would be consistent with the popular discourse on machine learning (i.e. that it's a black box where you "just add data" and it will learn something useful) but not with reality.

rinigus commented 7 years ago

That's correct - you did mention it as well!

rinigus commented 7 years ago

I wrote a small geocoder that uses libpostal for normalization and parsing. I incorporated it into offline server that can be used in either Sailfish OS or Linux. With the country-based splitting, I can now use libpostal on mobile without any problems. Thank you very much for your help!

Now I am preparing to release the new version of the offline maps server to the users. As a part of the datasets required to work with the libpostal, there are files that were not split by country:

address_expansions/  language_classifier/  numex/  transliteration/

I wonder how I am supposed to distribute these files? Whether it would be better if I distribute these myself or can I forward users to your S3-based distribution? I presume that number of users would be about 300 at the beginning.

albarrentine commented 7 years ago

Hooray! You're welcome to distribute the files however you'd prefer. S3 is what I use as it's cheap and easy. The only thing I'd ask is that there be a disclaimer that the accuracies on per-country models are not guaranteed to be on par with the global one, and that if your users report parser accuracy issues/errors, they should only be reported to this repo if the issue can be reproduced with the global model, as that's what I'll be using to test/debug.

rinigus commented 7 years ago

Sure! Disclaimer added (README, About and Geocoder dialogs in the application): https://github.com/rinigus/osmscout-server/commit/765254856bfd1190e88bf6c1d88714924ebecc46 .

rinigus commented 7 years ago

I can see that you merged the parser branch into the master. Congratulations - looking forward to test your work with my geocoder on mobile! I will surely take it into my plans and with great interest following what you do. Probably it would take 1-2 month before I can start working on integrating new libpostal into the geocoder-nlp due to other parts of the code that need improvement, but it will surely be done.

In your README, you mention that "It's conceivable that libpostal could even be used on a mobile device...". You could safely replace it with "It's possible to use libpostal on mobile device", at least its used on Sailfish.

You mentioned earlier that in this new libpostal, I would have to expand addresses after normalization, right? So far (released libpostal version), I was running expand_address first and then normalized it. After that, normalized strings were searched for in geocoder database to find the matches. Now, is it

normalize
for n in normalized: expand_address

?

Meanwhile, together with the users, we have set up a server where we distribute country-based libpostal datasets together with geocoder data as a part of the data package. So far, it looks like all users are quite content (including myself as a user). Probably the next version would be even better :)

Thank you again for this great work!

albarrentine commented 7 years ago

Indeed! Added that note to the README.

The order of operations should be parse_address, then expand_address on each component, using different options. So the default expansion options use:

.address_components = LIBPOSTAL_ADDRESS_NAME | LIBPOSTAL_ADDRESS_HOUSE_NUMBER | LIBPOSTAL_ADDRESS_STREET | LIBPOSTAL_ADDRESS_PO_BOX | LIBPOSTAL_ADDRESS_UNIT | LIBPOSTAL_ADDRESS_LEVEL | LIBPOSTAL_ADDRESS_ENTRANCE | LIBPOSTAL_ADDRESS_STAIRCASE | LIBPOSTAL_ADDRESS_POSTAL_CODE

But for instance, if you already know this component is a house number, there's no point in expanding something like "S" to "South" or "San" or whatever. So by calling normalize with only .address_components = LIBPOSTAL_ADDRESS_HOUSE_NUMBER. This way there will be fewer permutations of the normalized strings because in most languages the bulk of the abbreviations are actually on the street component.

The above constants are all defined in libpostal.h

You'll probably be interested to know that the geodb has been eliminated altogether in 1.0, so no mmap or Snappy dependency. Also have sparsified the language classifier so it's now down to ~75MB (just finished training, uploaded that to S3 earlier today so may need to re-run make to pick it up). The only large model is the address parser (global is 1.8GB), and that's mostly because it's trained on 1 billion examples.

Training should work roughly the same, but in address_parser_train there are two command-line arguments to be cognizant of:

Added a section to the README on accessing the new training data. There's a file stored on S3 which points to the latest version of the training data. The files are also now gzip'd so they're a little lighter on the bandwidth.

Also note for C/C++ there are a few API changes. Since we were bumping it to 1.0, I tried to be a good citizen in our public headers and add "libpostal_" prefixes to everything. That's about it. Function calls, args, etc. are all the same.

rinigus commented 7 years ago

Thank you for detailed guidelines! These are of great help!

rinigus commented 6 years ago

As the data splitting works, I suggest to close the issue. If I don't here any objections within a week, I'll try to remember to do that

thehappycoder commented 2 years ago

How do I train it on a single country data?