openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.04k stars 417 forks source link

Simplify process for contributing languages/abbreviations #15

Closed albarrentine closed 8 years ago

albarrentine commented 8 years ago

Adding new abbreviations to libpostal involves 4 steps:

  1. Edit a text file in dictionaries
  2. Run python scripts/geodata/address_expansions/address_dictionaries.py to generate the C data file address_expansion_data.c (new version should be checked in)
  3. After compiling libpostal with make, run ./src/build_address_dictionary to build the fast trie data structure used at run-time
  4. Run libpostal_data (e.g. libpostal_data upload base $YOUR_DATA_DIR/libpostal) to upload files to S3 (read access to the libpostal buckets is public, write access is not)

Ideally contributors should only have to think about step 1 and the others should be run automatically as part of the build assuming tests pass, etc.

Akuukis commented 8 years ago

What is the best strategy to start populating the language file? Is going through dictionaries/en/*.txt and translating noteworthy words will do, or am I expected to double check OSM if that tag exists, or anything else?

albarrentine commented 8 years ago

That should work well. The file structure in dictionaries/en/*.txt is probably the most complete, so you can copy that structure and translate any word/phrase that seems significant. Which language(s) do you speak? I can give you Top-N tokens from OSM as well if that's helpful.

Akuukis commented 8 years ago

Thank you, then I will proceed like that. Yes, Top-N tokens would be very useful! I speak latvian (LV) language and know several people whom I could give to help with russian (RU) translations.

albarrentine commented 8 years ago

On the general issue, the process of contributing abbreviations and numeric expressions is now significantly easier. Updating the dictionaries used in libpostal is now just a matter of editing text files. On a pull request, Travis CI will build the necessary data files and run tests. When the changes are merged into upstream master, the new data files will be committed and the runtime versions will be pushed to S3.

@Akuukis that would be excellent! I've just created a new repo with the n-gram counts (frequent phrases of 1-5 words found in street names and venue names in OSM) for every language as browsable TSV files. Both Latvian and Russian are included, although we could use better coverage in libpostal.