Project Scope - Githubissues

ingalls commented 6 years ago

Hey Folks!

We've talked a bunch about what the scope of the project is and for awhile have informally tracked buildings and parcel data.

We've collected a massive amount of address data over the years and while we are by no means finished our core objective, there is a lot more good we could bring to the open data world by tracking related data in our current openaddresses style.

I would like to solicit feedback on how we should go about tracking this data. I'm going to have more dev cycles to spare and would love to put a lot of these to use in furthering this goal.

I think we have a couple options that come to mind:

Option 1 - Mono Repo

The first option (and probably my preference) is to adopt a mono-repo style approach with each geographic area getting a single file and multiple data types (buildings/addresses/parcel) and or multiple sources (county E911/county Assessor) being tracked in the same file. This avoids our current awkward city_of_random & city_of_random2 approach.

Potential Source - Just brainstorming

{
  "coverage": {
    "census": "stuff",
    "country": "us",
    "etc": true
  },
  "address": [{
    "data": "http://",
    "website": "here",
    "attribution": "Share me!",
    "conform": { }
  }],
  "building": [

  ],
  "parcel": [

  ]

}

Option 2 - Dual Repo w/ shared machine

openaddresses/openaddresses
openaddresses/openbuildings
openaddresses/openparcels etc.

Machine could then we iterated on without breaking any of our current formatting. This is definitely the easiest path with our current infrastructure but means a lot of duplication between projects and involves many extra steps for adding data/searching for a given area. MapServers prefixes between data types are usually the same.

I'm less excited about this approach as it will provide the most friction for future expansions in scope of data that we may want to track; roads, parks, trails, who knows.

cc/ @openaddresses/administrators @openaddresses/contributers

iandees commented 6 years ago

Thanks for putting this down "on paper" @ingalls! It's something I've had in the back of my mind for quite some time.

My first inclination would be that we keep a single repo and somehow split apart the different data categories into separate source files rather than a single file with multiple sources in it. I like the concept of a single file representing a single download task for machine.

To distinguish between different data types, perhaps we have a different root directory for each data type or a key in the source file that references what sort of data is getting downloaded and how it should be presented on whatever UI we use.

andrewharvey commented 6 years ago

I'm less excited about this approach as it will provide the most friction for future expansions in scope of data that we may want to track; roads, parks, trails, who knows.

Agreed, there is a huge scope to expand this so the approach decided should have minimal effort even expanding out to 50 or so data layers.

I'll shutout @stevage's https://opencouncildata.org/ (which is made up of the standards http://standards.opencouncildata.org/ and map aggregating all the data feeds https://opencouncildata.github.io/Platform/) as something very similar to an openaddresses with expanded data layers scope. AFAIK, the opencouncildata is AU only and doesn't transform and non standards complying data, but I see room for something global just like OA but for more data layers:

catalogues open data in a consistent machine readable way (OA sources)
transforms into a common schema (OA conform)
has CI/CD built in (OA machine)
publishes an aggregated dataset in a common schema (OA downloads)

ingalls commented 6 years ago

Spent a couple hours today sketching out what the change would look like on the oa/oa side

https://github.com/openaddresses/openaddresses/pull/3908

Next steps will be to have people take a look and then start working on updating machine

migurski commented 6 years ago

Interesting! There’s probably a way to support both old- and new-style source files in the same codebase, at least during a transitional period. I’d like to avoid a massive repo-wide edit in the sources, and instead support something a bit more graceful.

migurski commented 6 years ago

Summarizing some off-site conversation, it seems like we’ve settled on an iterative approach for this work. We think that a mono-repo is a good initial goal, and we’ll identify a single source such as Washington D.C. to try out new layers. Machine will be modified to support the new input, and that will let us figure out some of the harder questions, including:

What does a source represent? A jurisdiction or a geography?
What does success, failure, input, and output mean for a multi-layer source?
How can we best support old and new syntax simultaneously during a transition period?

andrewharvey commented 6 years ago

What does https://github.com/openaddresses/openbuildings mean in the context of this? Will it be merged into https://github.com/openaddresses/openaddresses once the v2 schema lands?

Based on the draft v2 schema I've been working on adding building and parcel sources for AU and NZ at https://github.com/openaddresses/openaddresses/tree/v2-au-nz-parcels-buildings but after seeing https://github.com/openaddresses/openbuildings I'm not sure where it should fit now?

iandees commented 6 years ago

I think we had some folks excited to help track down extra data so that's why links started showing up on the openbuildings repo. That's pretty ad-hoc right now and people are adding data as issues or as lines in the readme there.

With @ingalls' work happening on v2 in this repo, I think I'd like to see this extra data eventually show up back in this repo using the new schema.

In the mean time though, the openbuildings repo is a good place for folks to dump the results of their data searches.

andrewharvey commented 6 years ago

Awesome, that makes sense. Very excited to see this work pan out!

ingalls commented 6 years ago

@andrewharvey Exactly what @iandees said!

For some context around timelines and what not, the first pass at support for multiple layers can be found here. This is the change that allows machine to support the new V2 format.

The second and slightly tricker change will be to set up the internal queuing system to understand and process multiple layers. I've had to put this work on pause the last couple of days due to some pt2itp improvements that needed to land but hope to pick this back up next week.

andrewharvey commented 6 years ago

@iandees What do you think about merging https://github.com/osmlab/centerlines into OAv2. I think it's a natural fit as you get to reuse all the infrastructure code from OA.

iandees commented 6 years ago

Yep, that's been in my plans for a while. Especially now that @ingalls has started work on v2 stuff. For now I'll move it into the openaddresses organization.

ingalls commented 4 years ago

We are committed to the mono-repo approach. Work is progressing rapidly to support the Schema V2. The production instance can be found here: http://batch.openaddresses.io/data

openaddresses / openaddresses-ops

Project Scope #21

Option 1 - Mono Repo

Option 2 - Dual Repo w/ shared machine