openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.
http://results.openaddresses.io/
ISC License
97 stars 36 forks source link

New feature: CSV enhancer, a new processing step #283

Open NelsonMinar opened 8 years ago

NelsonMinar commented 8 years ago

Various discussions have led to a consensus that each source should result in two output CSV files. One output file that represents a direct translation of the authoritative source (called out.csv here), without any alteration beyond what is required to cast the data to our CSV schema. And then a second file that is "enhanced", with various improvements like de-duplication, street name rewriting, empty row removal, etc (called enhanced.csv here). This ticket serves to collect that discussion and start a design for a solution.

I propose adding a new stage to the source processing pipeline, to take place after conform. The enhancement code would take out.csv as its only input and produce enhanced.csv as its only output. Both files would then be placed in the final zip product for download.

Taking only out.csv as input for the enhancer may be unrealistic. We may need to also use the conform JSON spec so that we can have source-specific configuration for the enhancer. (Particularly for localization). I'm pretty sure the enhancer should not have access to the actual source datafiles; allowing that would significantly complicate the enhancer.

Here's a list of all the relevant GitHub issues where this idea comes from:

https://github.com/openaddresses/openaddresses-ops/issues/10 https://github.com/openaddresses/openaddresses-ops/issues/2 https://github.com/openaddresses/machine/issues/32 https://github.com/openaddresses/machine/issues/89 https://github.com/openaddresses/machine/issues/165 https://github.com/openaddresses/machine/issues/168 https://github.com/openaddresses/machine/issues/240

riordan commented 8 years ago

I really like this approach. It puts the project in the place of being able to distribute the raw materials, but also being able to benefit from (and share) what all of its downstream users learn from their enhanced cleanup processes.

I echo @NelsonMinar's point that out.csv on its own probably won't be sufficient for the enhancing process. I suspect we'll have cleanup rules that are context-driven (e.g. using the country to determine possible candidate languages for expansion). So both out.csv (content) + conform (context) make sense as the inputs to the process.

Additionally, it would be great if the enhancement code were really modular. I can imagine wanting to stack particular processing steps together and simplifying the process of contribution of a new step toward the cleanup pipeline. Not to mention making it easier if a downstream user only wants to run some of the enhancement steps, but not precisely as we're doing it.

Overall, really awesome!

Computer: Enhance

NelsonMinar commented 8 years ago

The change at #281 means there's now a small post-processing step done when regional zip files are collected. It currently does street name expansion; that's been removed from individual out.csv files now. My proposal was to put an enhanced.csv alongside every out.csv, not just in the collection. This is an expedient first step towards getting there.

migurski commented 8 years ago

Oh, so the enhanced thing would go into individual source outputs?

My thinking with having it be in the collections is that we might start breaking the relationship back to sources in other ways. For example, collecting all counties in a state together and deduping with the statewide data. It would blend the sources irretrievably.

NelsonMinar commented 8 years ago

My thinking was to have both out.csv and enhanced.csv in each individual source output, yeah. Most of the processing I had in mind is applicable to a single source, maybe even particularly so if we do locale-dependent enhancements. I also like the idea of enhancing the collections even further, but mostly I see the collections as just being the concatenation of the individual sources. You're closer to users than I am though, I could be wrong.

migurski commented 8 years ago

libpostal might be a possible processing step for the enhanced output: https://mapzen.com/blog/inside-libpostal; it’s supposed to help with normalization which should get us enhanced de-duping.

Curious to hear from Mapzen search team members here about its potential suitability— @riordan, @dianashk, @trescube?

trescube commented 8 years ago

If I understand the question correctly, you're looking to have abbreviated source values expanded, such as S Main St -> South Main Street. That means you're talking about the expansions mechanism in libpostal that can expand W to West, Ave to Avenue, and PA to Pennsylvania. It can certainly be a benefit to combinatorically create all possible definitions of the tokenized input. One of my concerns with libpostal's expansions mechanism is that it can be gratuitously liberal with normalizing. Take, for instance, the address 553 S Main St, Red Lion, PA 17356. Running

./libpostal "553 s main st red lion pa 17356"

results in the following:

553 south main saint red lion pennsylvania 17356
553 south main saint red lion pa 17356
553 south main street red lion pennsylvania 17356
553 south main street red lion pa 17356
553 san main saint red lion pennsylvania 17356
553 san main street red lion pennsylvania 17356
553 s main saint red lion pennsylvania 17356
553 s main street red lion pennsylvania 17356
553 s main street red lion pa 17356

There's probably a context in which san main saint is a valid interpretation of s main st, but it's probably not what you're looking for in OA, at least in it's current form.

Here's another example:

10 R St Washington DC

The results are:

10 r saint washington district of columbia
10 r saint washington dc
10 r saint washington 600
10 r street washington district of columbia
10 r street washington dc
10 r street washington 600
10 river saint washington district of columbia
10 river street washington district of columbia
10 river street washington dc
10 river street washington 600

As you can see it expands R to River, St to Street, and DC to 600 (Roman numerals). Only 2 of these results are actually correct interpretations of the input (as interpretable by humans). This example is a bit extreme since OA's usage of it would be only used on the street field, I'm just demonstrating some of the features I've seen and I'm certain that there's a street name somewhere that is written as unintended Roman numerals.

trescube commented 8 years ago

I'm in the middle of adding units to the US sources and realized that libpostal (the address parser) would be great for teasing apart a single concatenated house number + street name field to replace the frail regexp. For reference:

        "number": {
            "function": "regexp",
            "field": "address",
            "pattern": "^([0-9]+)"
        },
        "street": {
            "function": "regexp",
            "field": "address",
            "pattern": "^(?:[0-9]+ )(.*)",
            "replace": "$1"
        }
migurski commented 8 years ago

There are Python bindings, so it should be possible to slip it right into the spec. Might be necessary to get it into PyPI if it’s not there already, though.

trescube commented 8 years ago

I was talking to Al today, it needs ~1.5gb to run. Hopefullly that isn't a problem.

migurski commented 8 years ago

1.5GB of RAM? It’d be a problem right now, but the EC2 and autoscale configurations are due for an overhaul with Ubuntu 16.04 LTS and Python 3. We could choose a configuration with more RAM, if this helped a lot.

riordan commented 8 years ago

We've been talking about building a standalone libpostal service. This might be a case for that, especially if its operating with a bulk endpoint.

migurski commented 8 years ago

Standalone libpostal exists now: https://mapzen.com/documentation/libpostal/

Is this still applicable?