openaddresses / openaddresses-ops

Issues-only repo for discussion of operational considerations for OA
6 stars 5 forks source link

Discussion: what should be in the big U.S. download files? #10

Closed migurski closed 7 years ago

migurski commented 8 years ago

I’m hearing broad consensus from the primary users of the bulk downloads files available at http://results.openaddresses.io that they should be processed beyond a simple conform. @feomike and his team at CFPB would like to see errors such as incorrect quotes from source CSVs scrubbed, and zip codes auto-assigned based on Census ZCTA for US addresses. @stephenkhess, @dianashk, and others at Mapzen would like to see known junk data such as duplicate rows or missing house numbers removed (e.g. https://github.com/openaddresses/machine/issues/240).

We have a lot of existing knowledge about addresses in the U.S., and it might make sense to nudge the downloadable collections into a more post-processed direction, taking care of some of the semantic issues above.

I’m interested to hear opinions on the level and kind of post-processing we may want to do to the downloads.

erictheise commented 8 years ago

Thanks for the heads up, @migurski!

feomike commented 8 years ago

agreed, thanks @migurski - i generally like the idea of the cleanup. we love the the regional download .csv's. the fields there (i think there are 9 fields), are right for our uses. the files are generally consistent, although ensuring they are all with the appropriate state would be best. we would also really love if they all had more consistent population (eg number didn't have PO box or 9999 values; street was only street and not concatenation of the whole row) etc. all this being said, great work.

NelsonMinar commented 8 years ago

Mike you asked this question about the US, but doesn't it apply to every source?

I still like the idea of authoritative sources I proposed. Altering the data makes it no longer authoritative. But I suspect almost everyone who uses our data would prefer cleaned and processed data. We can always provide undoctored data too for folks who need it.

However, the challenge is doing this processing right. Are there folks with geocoding expertise we can drawn on for how to do the postprocessing? Would MapZen's libpostal be useful?

migurski commented 8 years ago

@NelsonMinar: it does, but I’m not sure we know enough about other places to confidently make judgements about correctness, so I’m just thinking about the U.S. for now.

I’m not sure that this kind of alteration conflicts with the proposal of authoritativeness. We’re already conforming it to a new schema so there’s some alteration, and we’re pointedly not accepting user-submitted changes or additions which would definitely be non-authoritative. I think that automated de-duping and checking would still fall within the definition of authoritative data. I think.

I think @stephenkhess probably has the direct experience we want.

@feomike: a couple of the things you mention sound like conform errors to me, such as the PO boxes which would be the result of mistakenly not using situs address.

feomike commented 8 years ago

@migurski - thats right, we would advocate for fixing the conform errors. i interpreted @NelsonMinar's comments (haven't digested all of his proposal yet) to perhaps be 'perhaps backfilling things like ZIP where not found is against the authoritative schtick we have'. i agree w/ that. keeping OA as close to source as possible is a good approach.

migurski commented 8 years ago

A yeah, agreed that backfilling zip would be be anti-authoritative.

waldoj commented 8 years ago

It would be hugely helpful to provide per-state downloads within the US. Even for states that provide their own statewide files, since our data may well be updated more often. Promoted properly (which I'm prepared to do), this could be a great way to promote OA and the OA concept to state GIS offices and local GIS offices.

iandees commented 8 years ago

In general, I think we should introduce a post-processing workflow that works on the existing OA data (probably based on the existing one that Mapzen uses?). Such a workflow could merge duplicate addresses, remove "obviously wrong" data, etc.

I am much more conflicted about that process layering new data into existing "authoritative" data from the source. There's definitely some editorial decisions we would need to make. For example, adding zipcodes from ZCTAs doesn't seem like something we should be doing, but merging a state-level and county-level address to create a more complete row for the same address seems like something we should do.

Maybe the answer is to have a conservative default set of modifications we apply and distribute on openaddresses.io, but we could document and make easy other processing steps for those who want to have different editorial direction.

migurski commented 8 years ago

This came up again in another issue: https://github.com/openaddresses/openaddresses/issues/1456

It reminded me that we are still doing bad street name expansion in the collection files, and that we’ve had a few discussions on the future of those files. Bumping this issue in case anyone’s had any more recent thoughts about it.

migurski commented 7 years ago

We had an internal Mapzen conversation about these with OA folks @iandees, @dianashk, and @trescube present. I believe that the consensus on this question might be to remove post-processing from the big download files, and leaving the data contents identical to the input. @ingalls, @sbma44, @slibby, @feomike, and others who might be consuming these files: any objections?

sbma44 commented 7 years ago

This sounds wise to me. My own sense is that OA shouldn't mess with abbreviations, but can usefully normalize character encodings, CSV correctness and reproject geometry. Adding ZCTAs doesn't seem like a great idea, to be honest (the Census files are pretty out of date at the moment anyway).

NelsonMinar commented 7 years ago

I've lost the plot here. Is the proposal to stop rewriting street names entirely in all output files? @migurski said "big download files" but why not the small ones too?

Later on I could still see us also publishing nicely cleaned up street names as a separate value-add process. But I think it's important to pass through the original text in the primary files, as part of our story of being authoritative data.

ingalls commented 7 years ago

@migurski @sbma44 Yeah I've definitely switched my opinion here since I wrote the first version of OA that rewrote abbreviations. Happy to see this code implemented by the individual data-user and not our ETL process.

migurski commented 7 years ago

The only place we currently attempt to expand street names is in the big collection files — we've never touched them in the individual per-source files. Some of that logic is here FYI. It sounds like we agree that we should stop doing this everywhere!

NelsonMinar commented 7 years ago

Great! In the long long ago we did do the street name manipulation on all output files; I forgot you changed that awhile back to only be the big collections. Yes let's just get rid of it everywhere.

Presumably MapZen will post-process these files with its own fancier expansion, maybe using libpostal? Any chance of donating that code or output to the OA project so others can do it?

migurski commented 7 years ago

Our attempted expansions are wrong enough to expand "La Guardia" to "Lane Guardia" and "Dr Goodell" to "Drive Goodell", so I have some hopes for Al’s libpostal approach to get this right.

NelsonMinar commented 7 years ago

I did write that code under protest :-)

trescube commented 7 years ago

Some expansions get a little tricky, for instance St can be expanded to Saint and Street but token position isn't a guaranteed (that is, expand St->Street if it's the last token), namely in the case of St of Dreams in Martinsburg, WV, among others.

iandees commented 7 years ago

A couple thoughts:

I'm very much in favor of simply collecting address data directly from the local source and distributing it as-is.

I'd also like to see us apply a set of changes to that data (de-dupe, expand contractions, remove data that is obviously in the wrong place, etc.) that everyone might appreciate and distribute that data. These "munging steps" should be easily downloadable and usable with the above as-is data so that data consumers can mix and match the manipulations if they don't like our default.

Specific to name expansion: we currently concatenate the street names when the source has them split apart. If we expand our output schema to include these separated names (and update our conform objects) then name expansion becomes vastly easier because the expander has the context it needs to reliably make the expansion decision. i.e. "Dr" in "street type" column should expand to "Drive", whereas "Dr" in the "street name" column probably shouldn't be expanded at all.

migurski commented 7 years ago

To that end, I suppose it should be made easier to iterate over addresses in some arbitrary collection of zip files starting from state.txt.

feomike commented 7 years ago

thanks for including us in the discussion. i like this idea.

migurski commented 7 years ago

There is a new PR: https://github.com/openaddresses/machine/pull/435

migurski commented 7 years ago

Closed in under 11 months!