openvenues / libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
MIT License
4.08k stars 420 forks source link

Willing to help with an automated training pipeline #163

Open omnifroodle opened 7 years ago

omnifroodle commented 7 years ago

@thatdatabaseguy I believe you had mentioned wanting to setup a regular training set build and training run process on AWS. Is that something you are still interested in, and could I offer my time to help with the setup?

albarrentine commented 7 years ago

Hey Matt - thanks for supporting the project. There's a 1.0 release coming up, so after that's merged will look into automated training (it wouldn't make sense to automate master when everything's about to change). I'm documenting all the pipeline steps in a script which will be published along with the release and at that point it should be relatively easy to make it a periodic job (would ideally like to do it with something like AWS Lambda for scheduled events so it's not a long-running server).

Training set construction takes long enough that it's probably worth converting it to a series of ElasticMapReduce jobs that can be run in parallel, though there are dependencies and the point-in-polygon tests are quite memory-hungry so would have to be sharded in a way to keep memory consumption per node low. The simplest way to implement that would be to load only the less-expensive R-tree in the mappers, send all the candidates for a given polygon id to the same reducer, and rely on the sort mechanism such that effectively only one polygon needs to be loaded into memory at a time in the reducers. The parser training procedure itself is inherently a serial algorithm so that wouldn't benefit as much from parallelization, and in any case it's quite fast.

The other two pieces are model versioning (I've started versioning this with the training data, models will be the same deal), and possibly some sort of monitoring of the objective function per build. Happy to discuss after the release.

omnifroodle commented 7 years ago

Sounds like fun. I'm decent with Spark, so moving the training set creation to ElasticMR sounds interesting.

Good luck on the release, I may play a bit with this in the mean time.

hamiltonchua commented 7 years ago

hello, I'd like to ask if there's progress with this. thanks !

albarrentine commented 7 years ago

@hamiltonchua not yet. I'm currently testing out a Spark-based implementation for generating the training data as the new OSM polygons are testing the limits of a single machine, but it may be a while before it can be fully automated.

lake-effect commented 6 years ago

Just checking in on this ticket--how does it look right now?