openaddresses / openaddresses-ops

Issues-only repo for discussion of operational considerations for OA
6 stars 5 forks source link

What does an MVP for an OA geocoder look like? #12

Closed waldoj closed 8 years ago

waldoj commented 8 years ago

I envision a bespoke Pelias instance creator, where somebody can indicate what physical area that they're interested in, and get a geocoder preloaded with that data from OpenAddresses. I think these are the basic components of that:

  1. A system to ingest OA data and, in response to a geoquery, return address geodata for that area.
  2. A generator of machine images in common formats (e.g., Docker, Vagrant, Heroku, AMI) that can package the requested geodata with Pelias to be deployed by the end user.
  3. The (eventual) capacity for those machine images to request updated data automatically and periodically.

The idea is to close the loop on the publication and consumption of address data. Right now, governments publish address data, which we aggregate within OpenAddresses, and the private sector uses address data published on OpenAddresses. That fails to provide incentives for governments to continue to publish that data. (This is unrelated to those governments who publish address data via ArcGIS, in which case we're getting the data where they happen to store it. They already have existing, internal incentives.) This model will allow governments to run local geocoders (much faster than an API) powered by their own data, that improve as they improve their own data, and that are only updated as often as they update their public data. This creates a better incentive for them to publish that data.

I propose that the MVP for this consists of step 1 in the above list. The 2 subsequent steps depend on step 1, so it can't be either of those. And step 1, on its own, is useful—people can use that as-is, or build atop it.

What's the consensus here? Is this a good MVP? Are the subsequent steps the correct ones? Bonus questions: Do existing project volunteers have the capacity to make step 1 happen, or is this something that should be bid out? (Is it even plausible to bid this out?)

migurski commented 8 years ago

This is a great idea, and thank you for pushing it forward.

For the data ingestion, we’ve talked internally about what an ElasticSearch data prep process would look like for small extracts of OA data. @orangejulius or @dianashk are most up-to-date on this topic.

For the machine images, I think it would make sense to tighten the list and support a smaller range of possibilities. When an offered format stops working, it’s a debugging and support bummer for us. I think Heroku is an interesting direction, and I’ve built “app builders” before that anyone with an account should be able to point-and-click their way through. High effort, maximum reach, strong dependency on single vendor. I think that Docker or Vagrant approaches are a weak compromise: easy for the kinds of nerds who don’t need it, but still too difficult for mortals. AMI is up there somewhere, and could be scripted using Amazon’s API and a builder-style approach with some effort.

I have a weak bias toward a trash-and-replace model for the data updates. If it’s easy to set one of these up, it should be easy to use rapid replacement instead of updating.

NelsonMinar commented 8 years ago

I like the idea! Could you say more about who might use it? I'm trying to figure out who isn't served by just using a public geocoder, either free or paid.

dianashk commented 8 years ago

This is a very cool use of Pelias, so we're excited to see it come to fruition... even if we don't have the bandwidth to do it ourselves. Hooray for open-source!

As it stands today, Pelias is already setup to ingest all or any subset of OA data that you point it at. Setting this up isn't elegant at the moment, and this is where the majority of the work needs to be done. We're working on something to make it a bit simpler to install and build the whole system. Users would still need to install Elasticsearch on their own. So this effectively covers step 1.

I personally like the idea of supporting something simple and accessible, like Heroku, for the first attempt at a builder. If that is all successful, we can always branch out to support other platforms. But no need to rush there.

As for automated updates, we can set it up to rebuild on a schedule, like we currently do with our hosted Mapzen Search instance of Pelias. We rebuild weekly, because we do the whole world and it takes a few days. But with a small dataset you can rebuild daily or even hourly to keep the data fresh. We don't currently support real-time updates, so getting that implemented would require some significant effort.

waldoj commented 8 years ago

Could you say more about who might use it? I'm trying to figure out who isn't served by just using a public geocoder, either free or paid.

There are no free, public geocoders that aren't license-restricted (e.g., Google) or query-restricted (e.g., TAMU). So there's a big obstacle for a lot of people. Paid geocoders have a price tag that's a real burden on good work. (For example, I wanted to geocode every business in Virginia, as a public service. That was going to cost $1,200. Nope. Turned out, Virginia has a geocoder that is open to the world, and I used that, which took care of the ~75% of addresses that are within Virginia.) The next obstacle is speed. Making a call to a remote API takes time. Making a million calls to a remote API takes a million times longer. Being able to run a geocoder locally is vastly faster.

I appreciate that, from your perspective, geocoding seems like a highly-available service. But that's true for vanishingly few people.

waldoj commented 8 years ago

As it stands today, Pelias is already setup to ingest all or any subset of OA data that you point it at. Setting this up isn't elegant at the moment, and this is where the majority of the work needs to be done.

Would you please explain this process further? If I wanted to stand up a Pelias instance for the greater Charlottesville, VA area, what steps would that entail?

NelsonMinar commented 8 years ago

Thanks @waldoj! My apologies, I didn't mean to question whether an install-your-own geocoder was a good idea. I was just trying to understand who might be users of it. I think you've identified three reasons for running your own: more useful than a free service, cheaper than a paid service, and faster if you run it locally. I think in all cases it's a user who is motivated to do a bit more work to get things running for themselves rather than just paying a service provider.

To that last point, faster if you run locally, that would argue for a self-hosted option. Ie: not Heroku or EC2 or some other remote server, but something you can run on your local network as well.

One old school proposal for a deliverable: an Ubuntu PPA that lets you run apt-get install openaddresses-geocoder, built on top of Ubuntu 16.04 LTS. It would require several packages. The stuff required to run Pelias, Pelias itself, and then an OpenAddresses-specific package that contains the scripts necessary to download and install the OA data. You could package the data itself as Ubuntu packages too but that only makes sense for a few well-defined geographic regions, not customized data dumps.

Another old-school proposal is just good documentation. Work with Pelias to make it really easy for someone who knows some command line to install it, then write those download + import scripts. That requires more work on the part of the user than Ubuntu packages, but is (in theory) usable in many Unix environments.

For modern new stuff everyone seems to love Docker. A Docker container that just served geocoding data would be pretty neat. I agree with @migurski that it's more realistic to only support one or a small set of possibilities.

migurski commented 8 years ago

I am convinced about the self-hosted option. I know that Mapnik has a ton of experience with Ubuntu releases and later with a PPA, so I'd like see if @springmeyer has any wisdom or advice to share.

waldoj commented 8 years ago

I didn't mean to question whether an install-your-own geocoder was a good idea.

That's too bad, because you should. I often convince myself that my terrible ideas are brilliant! :)

To that last point, faster if you run locally, that would argue for a self-hosted option. Ie: not Heroku or EC2 or some other remote server, but something you can run on your local network as well.

I don't think self-hosted is the only use case, I just think it's a good one. But I am persuaded that, in terms of prioritization for deployment methods, it's worth favoring deployment methods that work locally ahead of those that only work remotely. Docker works well for both—you can run it locally, or can you can deploy it to AWS/Heroku/DigitalOcean. Seems like the way to start!

migurski commented 8 years ago

I'll research the PPA path. For various reasons I'm really bullish on that and not on Docker these days, mostly due to some experience with Docker oddities biting me.

riordan commented 8 years ago

We've been talking about npm-ifying pelias so that you can npm install pelias and then pelias install a full setup. But a PPA could take care of the Nodependencies and the Elasticsearch installations. Could be a solid start.

Then our efforts would be in building a really lovely configuration & build wizard to help folks pick the datasets/regions they're most interested in.

migurski commented 8 years ago

Yeah, that is my thinking as well. npm would be the developer / tester installation method of choice, while an apt package might be accessible more broadly and would allow for simple usage like RUN apt-get in a CI config, Dockerfile, Vagrantfile, or other Productfile.

migurski commented 8 years ago

Oops, did not mean to hit the mic drop button.

migurski commented 8 years ago

I have done a bit of work on getting .debs and PPAs set up. I’ve successfully installed a package of my own from a non-PPA URL added to sources.list, and now I’m waiting for some key-signing step in Launchpad that’s supposed to take a few hours. Baby steps, so far so good, seems to work.

Mostly cross-referencing suggestions from these articles:

My goal is to get to approximately where Dane and @rcoup succeeded with https://launchpad.net/~mapnik

migurski commented 8 years ago

After a bit of back-and-forth with a helpful Ubuntu Launchpad person, I’ve gotten… someplace.

It’s a surprisingly fiddly process but I’m liking the progress. Feeling like it’s a thing that’s possible to understand.

NelsonMinar commented 8 years ago

Do you have a feeling for if a Debian/Ubuntu package is a reasonable deliverable? I threw that out there as an idea but I'm not confident it's the right thing.

migurski commented 8 years ago

I don’t have a feeling for it yet. I believe this is a one-time pain and so far it’s been about the same level of b.s. as I’ve experienced with Docker and Vagrant. It still looks worthwhile.

waldoj commented 8 years ago

I've really dived into Docker into the past week, and I feel good about using a .deb as a deliverable. That's a single line in a Dockerfile, and of course just as easy to use outside of Docker. I like it.

NelsonMinar commented 8 years ago

I guess the question is requiring Ubuntu. Is that OK for our target users? I think it's the best guess of the Linux distros, but I see a lot of CentOS/RedHad variants in use too.

migurski commented 8 years ago

I’m not as familiar with the Red Hat environment, so I wonder whether it’s possible or advisable to skip the PPA route, and self-host .deb files and RPMs in one place?

Having spent some time with PPA’s, it’s attractive to just put a .deb at a URL someplace and be done with it. I haven't yet successfully installed my test package at https://launchpad.net/~migurski/+archive/ubuntu/hello.

NelsonMinar commented 8 years ago

PPAs offer a lot of advantages though, it's required to make apt-get upgrade and other apt stuff work. If you get frustrated I could take a look, or maybe someone from Ubuntu will help us?

The drawback of supporting RPMs too isn't so much building the RPM, it's sorting out the operating system compatibilities, library versions, etc. That's why I suggested just supporting Ubuntu LTS 16.04; the M in MVP.

waldoj commented 8 years ago

I guess the question is requiring Ubuntu. Is that OK for our target users?

It's fine for Docker, at least (because I don't think many people could care which distro that their Docker instance runs). Personally, I look forward to the problem of people saying "gosh, I'd love to use this, but I use CentOS." That seems like a bridge worth crossing when we come to it. :)

migurski commented 8 years ago

Spoke with Nelson offline, and he offered to help with two things I'm stuck on: PPAs with multiple owners (since we’ll likely want one called openaddresses or openaddr), and getting my hellodeb package actually installed.

migurski commented 8 years ago

Followup to the last note:

migurski commented 8 years ago

I got Pelias API published and installed to my PPA sandbox.

migurski commented 8 years ago

Based my tests, I think this would be the bones of an installation process for Ubuntu 16.04, and ought to work manually or in a container-type context:

  1. Install Oracle JDK, using instructions from Pelias install docs.
    • add-apt-repository ppa:webupd8team/java -y
    • apt-get update && apt-get install oracle-java7-installer -y
      This throws up a license acceptance form; I’m not sure how it will behave under Docker or Vagrant.
  2. Install ElasticSearch, using instructions from Elastic.co.
    • wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
    • echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main" | tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
    • apt-get update && apt-get install elasticsearch
  3. Install Pelias from OpenAddresses Ubuntu PPA.
    • add-apt-repository ppa:openaddresses/geocoder -y
    • apt-get update && apt-get install pelias-api
  4. Work with Pelias team to document import of sample or extract data into ElasticSearch index.
  5. Make Pelias API available on public port 80 with an HTTP proxy, and possibly packaged documentation.
NelsonMinar commented 8 years ago

That's a pretty straightforward set of instructions! Shame it's all third party repos, but perhaps that's unavoidable.

migurski commented 8 years ago

Yeah. ElasticSearch suggests that the open JDK might work, but @baldur reports having seen problems using it with ES. Only Oracle’s is officially supported. Getting https://github.com/pelias/schema and sample data in there is a next step.

A possible Dockerfile:

FROM ubuntu:16.04
RUN add-apt-repository ppa:webupd8team/java -y
RUN add-apt-repository ppa:openaddresses/geocoder -y
RUN wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
RUN echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main" | tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
RUN apt-get update
RUN apt-get install oracle-java7-installer elasticsearch pelias-api -y
NelsonMinar commented 8 years ago

I was wondering if this was complicated enough it should be encapsulated in a script, or a Dockerfile, or an image. The nice thing is the Ubuntu packaging is worth the effort since it makes that script simpler too.

waldoj commented 8 years ago

I had to add, after the first line:

RUN apt-get update -y
RUN apt-get install python-software-properties -y
RUN apt-get install software-properties-common -y
RUN apt-get install wget -y

I know that apt-get update is frowned upon in a Dockerfile, but I couldn't install add-apt-repository without it.

waldoj commented 8 years ago

It finally died with this:

Errors were encountered while processing:
 /var/cache/apt/archives/oracle-java7-installer_7u80+7u60arm-0~webupd8~1_all.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)
The command '/bin/sh -c apt-get install oracle-java7-installer elasticsearch pelias-api -y' returned a non-zero code: 100

I'm not sure why (other than Java ¯_(ツ)_/¯), but I'll see if I can figure out what's up.

migurski commented 8 years ago

Damn, I bet that's the part where it asks for a license click-through.

migurski commented 8 years ago

I would be curious to learn more about why Oracle’s Java is necessary for ES. Maybe for smaller uses, it’d be sufficient to use Open JRE?

waldoj commented 8 years ago

Here is a product matrix of which JVMs work with which Elasticsearch versions. I don't see Open JRE on there, but I know very little about Java, so that may or may not mean anything.

migurski commented 8 years ago

The OpenJDK in 16.04 says this about itself:

Package: openjdk-8-jdk
Priority: optional
Section: java
Installed-Size: 458
Maintainer: OpenJDK Team <openjdk@lists.launchpad.net>
Architecture: amd64
Source: openjdk-8
Version: 8u77-b03-3ubuntu3
Provides: java-compiler, java-sdk, java2-sdk, java5-sdk, java6-sdk, java7-sdk, java8-sdk
…
Description-en: OpenJDK Development Kit (JDK)
 OpenJDK is a development environment for building applications,
 applets, and components using the Java programming language.
 .
 The packages are built using the IcedTea build support and patches
 from the IcedTea project.
…

So it’s using IcedTea. I believe Java 8 is internally 1.8, so it also matches the supported 1.7.0.55+ version number. Waldo, what happens if you replace the oracle-java7-installer installation with openjdk-8-jdk? For me, ES seemed to work.

migurski commented 8 years ago

Simpler possible Dockerfile:

FROM ubuntu:16.04
RUN add-apt-repository ppa:openaddresses/geocoder -y
RUN wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
RUN echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main" | tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
RUN apt-get update
RUN apt-get install openjdk-8-jdk elasticsearch pelias-api -y

Waldo, for me it was not necessary to install python-software-properties software-properties-common wget to get add-apt-repository; it just worked. Curious why.

waldoj commented 8 years ago

Running that Dockerfile yields this:

$ docker build .
Sending build context to Docker daemon 2.048 kB
Step 1 : FROM ubuntu:16.04
 ---> 44776f55294a
Step 2 : RUN add-apt-repository ppa:openaddresses/geocoder -y
 ---> Running in bac8df07b705
/bin/sh: 1: add-apt-repository: not found
The command '/bin/sh -c add-apt-repository ppa:openaddresses/geocoder -y' returned a non-zero code: 127

I needed to add these to get this to run:

RUN apt-get update -y
RUN apt-get install python-software-properties -y
RUN apt-get install software-properties-common -y

When I did that, this was the outcome:

Step 6 : RUN wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
 ---> Running in fa96df30947b
/bin/sh: 1: wget: not found
gpg: no valid OpenPGP data found.
The command '/bin/sh -c wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -' returned a non-zero code: 2
migurski commented 8 years ago

I guess the 16.04 Docker image is much slimmer than the server distribution, which I suppose makes sense. So:

FROM ubuntu:16.04

RUN apt-get update -y
RUN apt-get install python-software-properties software-properties-common wget -y

RUN add-apt-repository ppa:openaddresses/geocoder -y
RUN wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
RUN echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main" | tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list

RUN apt-get update -y
RUN apt-get install openjdk-8-jdk elasticsearch pelias-api -y
migurski commented 8 years ago

Trying to create the Pelias index failed for me with this message:

[mapper_parsing_exception] analyzer on field [borough_id] must be set when search_analyzer is set

@orangejulius pointed out that Pelias wants ElasticSearch 1.7, so the process should look like this with 1.7 instead of 2.x:

FROM ubuntu:16.04

RUN apt-get update -y
RUN apt-get install python-software-properties software-properties-common wget -y

RUN add-apt-repository ppa:openaddresses/geocoder -y
RUN wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | apt-key add -
RUN echo "deb http://packages.elastic.co/elasticsearch/1.7/debian stable main" | tee -a /etc/apt/sources.list.d/elasticsearch-1.7.list

RUN apt-get update -y
RUN apt-get install openjdk-8-jdk elasticsearch pelias-api -y

That works:

% node scripts/create_index.js;
[put mapping]    pelias      { acknowledged: true }
migurski commented 8 years ago

Aw yeah, getting some results from a single-county import: http://dpaste.com/0612XXJ

waldoj commented 8 years ago

!

riordan commented 8 years ago

Mazel!

Sent from my iPhone

On May 6, 2016, at 9:03 PM, Waldo Jaquith notifications@github.com wrote:

!

— You are receiving this because you commented. Reply to this email directly or view it on GitHub

migurski commented 8 years ago

This is basically current: https://github.com/openaddresses/pelias-ubuntu-xenial#readme

There’s still some documentation to do around database setup, address import, and why @#$% elasticsearch doesn’t want to start on boot. Also, Amazon are taking their time making an Ubuntu 16.04 image available and there’s not yet a supported upgrade path, so maybe we should build these for 14.04 as well?

migurski commented 8 years ago

Progress report: I’ve run the setup above on a few machines, and I’m slowly working through the foibles of ElasticSearch. It’s pretty greedy for RAM; even running import on a 4GB had troubles and @missinglink suggests 8GB. Still don’t have an idea on getting it to start at boot.

I did build Ubuntu 14.04 versions of all the packages, though. This is getting close to blog post or tutorial state, though I still there are going to be some bad ops surprises for users.

migurski commented 8 years ago

I blogged the process for getting this set up, here: http://mike.teczno.com/notes/openaddr/5min-geocoder.html

NelsonMinar commented 8 years ago

That's amazing @migurski.

migurski commented 8 years ago

I… think it’s possible to close this issue?

iandees commented 8 years ago

I agree. It might be good to find a place to put your blog post in our repo as a document for people to follow.

On Sat, May 28, 2016, 13:12 migurski notifications@github.com wrote:

I… think it’s possible to close this issue?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openaddresses/openaddresses-ops/issues/12#issuecomment-222319186, or mute the thread https://github.com/notifications/unsubscribe/AAP90A8bSJuG1IT5Kcl81EMucWtAAOFvks5qGHeVgaJpZM4H_L7f .

migurski commented 8 years ago

Good call, I’ll do that.

migurski commented 8 years ago

Added a link to the bottom of the post, http://mike.teczno.com/notes/openaddr/5min-geocoder.html.

waldoj commented 8 years ago

:+1: