Hoot deletes and then re-creates reference features during merging/splitting, rather than modifying them, which results in geographically valid, but heavy-handed changesets - Githubissues

ngageoint / hootenanny

Hootenanny conflates multiple maps into a single seamless map.

GNU General Public License v3.0

358 stars 74 forks source link

Hoot deletes and then re-creates reference features during merging/splitting, rather than modifying them, which results in geographically valid, but heavy-handed changesets #1488

Closed brianhatchl closed 7 years ago

brianhatchl commented 7 years ago

Hoot makes the assumption that anywhere it needs to split or merge features that deleting the old features and creating new ones is acceptable. Geospatially, this does no harm but the end result can be a changeset that contains many removals of reference features. In cases where you want to protect your reference data from unnecessary changes (MapEdit) and have good provenance, this behavior is undesirable.

The goal of this task is to retain reference feature IDs so that resulting changesets always modify a reference feature. New features will still be created as the result of splits/merges, but the original reference feature that was split/merged will no longer be deleted.

bwitham commented 7 years ago

It should be adding in modify changes when necessary....not sure why its not. List the inputs and the command syntax used, and I'll take a look after I finish the ingest work I'm working on right now and a couple smaller things I need to do for Curran on poi to poly first.

brianhatchl commented 7 years ago

Looking at the osc, it seems like all the features are getting re-mapped ids, so maybe the conflation command is missing a required parameters. I'll paste it here when I dig it up.

brianhatchl commented 7 years ago

hoot --conflate -C RemoveReview2Pre.conf -D osm2ogr.ops=hoot::DecomposeBuildingR
elationsVisitor -D conflate.add.score.tags=yes -D hootapi.db.writer.overwrite.ma
p=true -D hootapi.db.writer.create.user=true -D api.db.email=test@test.com -D ma
p.cleaner.transforms=hoot::ReprojectToPlanarOp;hoot::DuplicateWayRemover;hoot::S
uperfluousWayRemover;hoot::IntersectionSplitter;hoot::UnlikelyIntersectionRemove
r;hoot::DualWaySplitter;hoot::ImpliedDividedMarker;hoot::DuplicateNameRemover;ho
ot::SmallWayMerger;hoot::RemoveEmptyAreasVisitor;hoot::RemoveDuplicateAreaVisito
r;hoot::NoInformationElementRemover -D small.way.merger.threshold=15 -D unify.op
timizer.time.limit=60 -D ogr.split.o2s=false -D ogr.tds.add.fcsubtype=true -D og
r.tds.structure=true -D duplicate.name.case.sensitive=true -D element.cache.size
.node=2000000 -D element.cache.size.relation=200000 -D element.cache.size.way=20
0000 -D conflate.match.highway.classifier=hoot::HighwayRfClassifier -D match.cre
ators=hoot::HighwayMatchCreator;hoot::BuildingMatchCreator;hoot::ScriptMatchCrea
tor,PoiGeneric.js;hoot::ScriptMatchCreator,LinearWaterway.js -D merger.creators=
hoot::HighwaySnapMergerCreator;hoot::BuildingMergerCreator;hoot::ScriptMergerCre
ator -D search.radius.highway=-1 -D highway.matcher.heading.delta=5.0 -D highway
.matcher.max.angle=60 -D way.merger.min.split.size=5 -D conflate.enable.old.road
s=false -D waterway.subline.matcher=hoot::MaximalSublineMatcher -D waterway.angl
e.sample.distance=20.0 -D waterway.matcher.heading.delta=150.0 -D waterway.auto.
calc.search.radius=true -D search.radius.waterway=-1 -D waterway.rubber.sheet.mi
nimum.ties=5 -D waterway.rubber.sheet.ref=true -D writer.include.debug=false -D 
convert.bounding.box=34.04725915004996,31.17907522629906,34.05654176863703,31.18
329523832873 -D conflate.use.data.source.ids=true -D osm.map.reader.factory.read
er=hoot::OsmApiDbAwareHootApiDbReader -D osm.map.writer.factory.writer=hoot::Osm
ApiDbAwareHootApiDbWriter -D osmapidb.id.aware.url=osmapidb://vagrant:vagrant!@l
ocalhost:5432/openstreetmap osmapidb://vagrant:vagrant!@localhost:5432/openstree
tmap hootapidb://hoot:hoottest@localhost:5432/hoot/293 hootapidb://hoot:hoottest
@localhost:5432/hoot/openstreetmap_blank2

bwitham commented 7 years ago

Do you have the source data for this? thanks.

brianhatchl commented 7 years ago

sent in email

bwitham commented 7 years ago

Update: In the command line test workflow, at least, I've determined where the ID's were being regenerated for the OSM API database data, and they definitely should not have been getting regenerated. Made the fix change in the test script and was able to keep the original node IDs intact, unless hoot conflation modified them, which is ok (still need to port the changes over to the services code). Haven't had a chance yet to look at the output changesets, but they should have fewer create/delete change combos and more modify changes. The changes run against Rafah unifying successfully but crash against Rafah network, where the changeset tries to insert a node that already exists into the historical nodes table....so need to look into that next.

bwitham commented 7 years ago

fixed bug where osm api db features were being deleted, even when conflation didn't touch them
fixed bug where modify changeset statements were being superceded by an additional delete statement (needs more testing); this fixed the changeset crash with network conflation

Even after these fixes, the changesets are still made up of mostly create/delete statements vs modify due to the fact when hoot conflates two ways it just overwrites the information of one with the other. The bookkeeping for this happens completely outside of the changeset derivation, so the changeset knows nothing about the provenance.

Going to have to think of some way to pass that information on from the conflation logic to the changeset derivation logic to transform some of the create/delete combos into modify statements OR rethink the conflation workflow again to allow that to happen somehow.

bwitham commented 7 years ago

Actually, maybe its as easy as looking for status=3, conflated and forcing modifies in those situations...

bwitham commented 7 years ago

Status=3 alone won't be enough, b/c we need to know what osm api db (status=1) feature was replaced by either the hoot clean or conflate operation with the status=3 feature.

Hoot used to keep a uuid tag on all features and concatenated IDs together during conflation which would have valuable information for this, but I'm not sure that tag is still around. It may have gone away after the review refactoring. If that tag isn't still around, adding a tag that tracks the unknown1 provenance for changeset purposes will be a massive change that will touch almost every part of the conflation and cleaning code.

bwitham commented 7 years ago

New ways seem to be created when ways from each dataset are merged at the end of the conflation, even if the way geometry kept is identical or nearly identical to that in the reference dataset (possibly with additional nodes added during splitting). So, maybe its possible to force the mergers to use the reference feature ID (osm api feature id). That would help keep the provenance of features in the reference dataset and result in modify changes vs creates/deletes. Not sure if keeping the ID would be desired in every merge case, but can give it a try to see what happens.

bwitham commented 7 years ago

Was able to hack an option into HighwaySnapMerger to retain the unknown1 feature ID post merge (caused a problem with the hilbert curve op, but will come back to that later). For unifying, that resulted in three new modify statements, all of which make sense so far and seem to be desirable. The output looks a little different, but I'm not sure that's a bad thing yet. Code definitely needs a ton more testing with other datasets and similar changes would need to be added to all mergers if its the right logic.

For Rafah, next place to look at would be the intersection splitter. I'm not completely convinced yet a modify change there makes sense, since it will split the same way many times...but then again it may make sense to retain the original feature ID the first time the way is split and then create new ones after that...going to try that logic next.

bwitham commented 7 years ago

Successfully made the ID preservation changes in the intersection splitter.

Worth going over one more time, but now Rafah output seems to show no deletes/creates for osm api db data that wasn't modified by conflation and only modify statements for any osm api db data that was modified by conflation (can still be additional create statements in the case of splitting, which is ok).

The same changes will have to be applied every where hoot modifies features during conflation/cleaning, which is likely a ton. I'm going to run the changeset derivation against a handful of datasets, work through any necessary ID preservation changes found in other sections of the code, and then after that leave all other fixes to be found on a case by case basis to prevent this from taking weeks to finish.

Next:

~~figure out why JOSM is rendering one of the modified ways in Rafah differently than expected after the merger changes~~ - can't figure this out, as the way being rendered differently has the exact same tag set as other ways with the rendering that I expect; not crucial to figure out the cause at this point, so will come back to it
~~run through more datasets~~ the existing unit tests should have good enough coverage for now
figure out how to port the related changes applied to the core test script to the new services code

bwitham commented 7 years ago

Haiti and ADT output looks good, but haven't looked at their changesets yet. DC output doesn't look good though, so will look at that first before looking at all of the changesets.

bwitham commented 7 years ago

DC streets triggers several extra reviews inside the conflate aoi with the id preservation setting enabled (not good). Also, a handful of other features out side of the conflate AOI don't merge and should with the setting off or on...some other ID related change must be causing that (also not good).

bwitham commented 7 years ago

The ID behavior refactoring may actually have nothing to do with the differences...not completely sure. But another change I had forgotten about was using the convert.bounding.box.osm.api.database option with the aoi in place of the convert.bounding.box option. Need to understand again why I made that change.

backing out the aoi and id preservation brings me back to the correct osm output, as expected....changeset output, haven't looked at again yet.

bwitham commented 7 years ago

Thinking more about it... convert.bounding.box.osm.api.database just doesn't make much sense to me as an option in this situation. Replacing that back with convert.bounding.box cleans up the DC output but causes a lot of features to drop from the Rafah output. I'm going to try to figure out what's causing that.

bwitham commented 7 years ago

reverting DC conflation back to applying same AOI to both data sources along with the ID preservation works. Also, can get it looking good for different AOI's within Rafah. But the original AOI drops features in the output....so some forward progress overall. Going to stay with this cropping strategy and try to find out what's weird with the one cropped section of Rafah. Next, look at the DC changeset...

bwitham commented 7 years ago

The section of code where the unknown1 id's are retained from the split scraps is what is wreaking havoc on the dc output. Take that code out and everything looks good with dc. Unfortunately, then some rafah ways which should be being modified go back to delete/create combo's.

So really dealing with two separate problems here:

differences in how bbox boundaries behave when conflating different datasets
differences in how this new ID preservation code works with different datasets

Going to focus on fixing the second issue, since that's the point of this task. Then, the first issue could be dealt with as part of #1256, if I can't figure it out how to fix it here.

bwitham commented 7 years ago

Ok, slight progress in dc. I was generating half a dozen unnecessary reviews, b/c I was leaving a reference to the unknown 2 element that I replaced with the unknown1 id. That was causing hoot to log missing id reviews. Removing that id from the bookkeeping, and now I'm getting 4 "complex conflict" reviews generated instead....slightly better.

bwitham commented 7 years ago

Complex conflict reviews are caused by an invalid match that is returned by the subline matcher...I'm sure I've messed something else up somewhere with these changes to cause them to appear.

bwitham commented 7 years ago

Seemed to have found a compromise that works for dc. Moved the reference id mapping and id replacement to just after conflation, rather than during the merging. No crashes or strange reviews. Now to see what Rafah looks like again.

bwitham commented 7 years ago

Rafah changeset looks good with the latest changes except for the unexplained dropped features mentioned before. I'm going to go ahead and see if that can be corrected now vs kicking the can down the road.

bwitham commented 7 years ago

the dropped feature problem is going to be hard. Will deal with it in one of the other issues instead. Next, going to see if any other mergers / cleaning operations besides HighwaySnapMerger and IntersectionSplitter need these changes.

bwitham commented 7 years ago

These are the classes I believe still need to be modified to support the ID preservation option (there may be others, but this should be most of them):

BuildingMerger
DualWaySplitter
DuplicateWayRemover
MergeNearbyNodes
PartialNetworkMerger
PoiPolygonMerger
RemoveDuplicateAreaVisitor
~~SmallWayMerger~~
WaySplitter

bwitham commented 7 years ago

Back to the original plan of holding off on this change for the other conflation/cleaning types, b/c its so large in scope. Would rather see this change proven to be successful with the current holy grail datasets before making the change across the board.

bwitham commented 7 years ago

Seemingly the only service change needed for this was the addition of the new preserve.unknown1.element.id.when.modifying.features setting to ConflateCommand, although I haven't tested it out yet.

Came across a problem, though. A couple of weeks ago out of frustration I added a reader.set.status option to force the hoot api db data to be unknown2. Looking at the conflation code that reads the data out of the hoot api db, it makes no sense that this setting should be needed. Unfortunately, removing it does break the whole holy grail conflation scenario. The problem with leaving it in is that there's no corresponding way to use the setting in the services code...works fine in the command line script where the data is being prepped just before conflation, but would make no sense using it in the real services based workflow. So, need to understand why the setting is needed and, if possible, either get rid of it or replicate the logic in the conflate command... another mind numbing day of debugging for this tomorrow :-)

bwitham commented 7 years ago

There is some logic in OsmApiDbAwareHootApiDbWriter::writePartial, used when initially writing the secondary dataset to the hoot api db that is causing this problem. That logic needs to be moved to another location, I think, since hoot api data will have already been written when running the real world changeset derivation scenario.

bwitham commented 7 years ago

I was able to get rid of reader.set.default.status (formerly reader.set.status) from the script and get valid input by restoring some code in OsmApiDbAwareHootApiDbReader that I had cleaved off a couple of weeks ago. Output is valid, but there are gaps in the ID ordering, which must be caused by too many calls to getNextId. So, need to prevent that and then this should be fixed.

bmarchant commented 7 years ago

@bwitham DuplicateWayRemover will be taken care of in #1518

bwitham commented 7 years ago

@bmarchant Cool. thanks. I wasn't going to mess with it in this task anyway. Decided to do just the bare minimum to get one test scenario to pass, which so far has only involved modifying HighwaySnapMerger and IntersectionSplitter.

bwitham commented 7 years ago

There is actually only one gap in the ID ordering. An extra ID increment of one for all data types is happening after the initial reference data is read...that's it. I'm willing to live with that for now. The rest of the ID handling looks good. Time to move onto the dropped features in changeset problem.

bwitham commented 7 years ago

This appears to be fixed now. The changesets are kind of useless with the dropped features, so I'm not going to merge this yet and am going to go ahead continuing working in this branch to make the #1256 changes.

bwitham commented 7 years ago

Some additional notes:

I am seeing a strange dangling node in the Rafah output. That looks to be a problem with the conflation, not the changeset derivation, so opened #1558.

Additionally, way 4 from the reference dataset is the only one I'm still seeing with unexplained behavior. Its being completed replaced by way 33 instead of being modified. I've been unable to track down where the ID is disappearing to...will come back to it.

bwitham commented 7 years ago

Changes I just made as part of #1256 seemed to have fixed the way 4 problem.

bwitham commented 7 years ago

TODO:

~~don't call pg next id in the xml changeset workflow~~ - COMPLETE
~~verify the status set in UnifyingConflator for modified features is correct; it may need to set status=3 OR status=1 based on what the conflation did~~ - COMPLETE
~~figure out why changeset.compare.status is needed. why are some features getting the conflated status with no tag or geometry changes? I believe the answer lies in the previous item.~~ - COMPLETE
~~remove changeset.allow.deleting.reference.features~~ NOT IMPLEMENTING - My ideas so far are either, not to do this and let the UI clean up afterward or to have the UI pass in the iD change history and make changes based on that info.
~~only allow new features in the changeset that fall completely within the bounds? - Is this a good idea? Writing some code to prototype and see what the output looks like.~~ - deciding against this for now
ways 4 and 33 in Rafah come back separate in the data read out after applying the changeset, but in conflated output before hand there is only way 33 - Way 4 appears to be removed by the SmallWayMerger, but the ref element protection code brings it back, which results in the duplicate with 33. So, the SmallWayMerger may need the same preserve ref ID treatment that HighwaySnapMerger and IntersectionSplitter have. COMPLETE - fixed this by updating SmallWayMerger to preserve ref id's

bwitham commented 7 years ago

merged