osmlab / labuildings

Los Angeles County building import
BSD 3-Clause "New" or "Revised" License
44 stars 14 forks source link

Need explicit conflict-resolution policy and workflow #65

Open dkogan opened 8 years ago

dkogan commented 8 years ago

Hi. I think we should decide very explicitly how data conflicts (between previous data, import data, aerial imagery) should be resolved. We talked about this a bit at the group session today, but I don't think we came up with a good solution. I'd like to see a workflow where the changesets and their logs are clean. The more machine-generated data we have, the more effective future scripting efforts will be; and thus it'd be good to have a very clear demarcation about what is machine-generated, and what isn't. Currently we're loading machine-generated data into JOSM, fiddling with it, and uploading a changeset that is some uncertain mix of machine- and human-generated data. If possible, I want to get away from that.

Motivating example: Suppose that in some hypothetical future, the authors of the data we're importing want to import the OSM data to THEIR dataset, to benefit from the human knowledge that OSM contains. THEY will have similar conflict-resolution pains. But if our changesets were very clear about what is the original imported data and what is a result of human fiddling, their job will be WAY easier. I.e. they only want to apply the human-generated data, and a clear demarcation helps.

I'd like to propose the following workflow: instead of each imported chunk being uploaded in a single changeset, we make two separate changesets:

  1. The machine-generated data only. This would contain no human intervention, so it would be full of errors. Stuff would overlap, new buildings would be missing, demolished buildings would still be there, etc. The log message would be machine generated to some standard string (LA import blah blah blah)
  2. The human-generated corrections only. The log message would describe what was actually changed; as it should be. These human-generated corrections could comprise several changesets, depending on what's actually in them. I'm not sure if these should be uploaded by the xxx_import user or by the xxx user.

This aligns more closely to how version control is done on code (or at least how it should be done!)

Some tooling changes would be required. I'm not yet intimately familiar with internals of JOSM and of the OSM protocol, so I haven't looked at the implementation details. We'd want to minimize the period of time between the two changesets being uploaded, since the first changeset puts the map in a potentially-wrong state. Ideally we'd generate the two separate transactions, and send them to the server in some atomic way. Does anybody know if JOSM and/or the server supports this? I can look it up, but if one of yall knows, then I'll defer to the experts.

Comments!

almccon commented 8 years ago

This is an interesting idea! I see the problem: when we import a batch of data, all those features are version 1 as far as OSM is concerned, although some of them may have been manually tweaked by the uploader, and therefore should be conceptually version 2.

But you are also wrong that we would potentially leave the OSM database in a wrong state for too long... things go wrong, and maybe the second changeset doesn't happen... that would be bad for OSM, and further turn public opinion against imports.

I don't know enough about JOSM or the OSM API to know if you could implement some kind of automatic and atomic two-stage changeset upload.

You might want to ask this question on the imports-us list to see how other imports have dealt with this problem.

almccon commented 8 years ago

The other thing is that so far nobody has completed the circle of bringing OSM-enhanced data back into government databases. Firstly there's the license incompatibility that stops it. Secondly, there's just too many subtle ways that the OSM data could evolve (which is great!) making it extremely hard to synchronize the data back again. This is also the reason why it's not very useful for us to include the lacounty:ain and lacounty:bld_id fields. Even if someone wanted to join OSM back with the city data again, many of those tags will be stripped or wrong.

The change detection tool that NYC is (was?) using is the only system that I can imagine would work moving forward (https://www.mapbox.com/blog/nyc-and-openstreetmap-cooperating-through-open-data/) but it doesn't capture the modifications made at the time of import.

Maybe the only way to see what changed is to look at the unmodified .osm chunks from s3 and compare them with the contents of the changesets that were actually imported?