sisskind commented 8 years ago

POI's are the same as defined in the current hoot POI to POI conflation. Polygons are defined as buildings and areas (e.g. a park).

Point-to-polygon conflation is a desired function for Hootenanny, so that the user has the ability to conflate features that are the same, just in a different geometry. For example, the ability to conflate a building POI with the building footprint.

This is the top level epic for this task.

High level steps:

evaluate the current state of the conflation; getting any existing tests up and running again
determine if the current set of quality tests is adequate or that more need to be added
establish conflation correctness exist goals and and work to achieve them
update docs, if necessary

sisskind commented 8 years ago

https://github.com/DigitalGlobe/VGI-team-repo/issues/1

bwitham commented 8 years ago

During speaking with Jason last month...some work on this has already been done, but much more needs to be done. So, first task is to figure out what the current state of the code is for this.

bwitham commented 8 years ago

Briefly perused the existing code and looked at the model training test in nightly. The test is missing some scripts in source control, so will start tracking those down next week.

bwitham commented 8 years ago

Got the scripts from Jason. Working on getting the poi building model training test running again.

bwitham commented 8 years ago

starting scores published after switching poi to poi over from PLACES to Unifying (poi to poi much improved after switch).

For tests B and C, not a huge difference in conflation quality when running with POI to POI vs without.

For test A, also similar totals between the two, but a much better correct percentage when running without POI to POI.

In test D, much better results when running with POI to POI. Also, since test D uses a much larger dataset than the others, running with POI to POI results in a very noticeable 4x-5x longer runtime compared to without POI to POI.

bwitham commented 8 years ago

EDIT

bwitham commented 8 years ago

Sorting through hoot:wrong in test output, test by test, now to see if there are any low hanging fruit situations where the conflation can be improved.

bwitham commented 8 years ago

After having looked at only a small part of the output in the A and B tests, so far am seeing several instances where a single poi building was matched against a few or more polygon buildings by the manual matcher, but hoot calls it a review. Given the rules layed out for poi building, this is acceptable I think, and obviously, isn't counting against the overall correctness score (correct + reviews). Will start hunting down hoot:wrong misses next after I fix some other bugs found while doing looking at test output.

bwitham commented 8 years ago

Resuming output examination after #854 and #856.

Another thing to start thinking about is why the scores drop when not running RemoveIrrelevants.js, which removes non-poi's and non-buildings from the input. Running that won't be an option in a production environment, so need to think about how it can be removed from the tests.

bwitham commented 8 years ago

EDIT

bwitham commented 8 years ago

Digging into hoot:wrong and looking at the first set of items where a match was expected but hoot classified as miss, I'm seeing a poi that is involved in reviews with the two buildings that it didn't match too. This seems to me like it should fall under expected match, got review instead, which would take it out of the wrong category. Looking more closely at MatchComparator to see if I'm just misunderstanding the output.

bwitham commented 8 years ago

I may have found proof that #167 needs more attention after all....or maybe I've found some other kind of issue altogether.

For the poi and two buildings mentioned in the previous comment, I'm seeing a case where the poi gets marked as needing to be reviewed against each building (as it should) by the poi poly matcher, but MatchComparator has null element id's for both buildings, so it ignores those two reviews and classifies them as wrong. The two buildings in question were merged (presumably by the building merger), and that is why the original building uuid's don't exist (new concatenated uuid's exist in their place).

So, technically applying the wrong score may be correct in this situation b/c the building merge occurred after the poi to building review was created. I wonder if there is some way to order the matching/merging so that the poi to building reviews in this example have the latest building uuid's and, therefore, will no longer fall into the wrong category...

I could also turn off the building matching, but then scores will drop significantly.

bwitham commented 8 years ago

There is a new unique situation here where since both poi poly and building conflation match buildings, the merging done by one can affect the performance of the other. In this case, the building merging is having an adverse affect on the poi to poly conflation. In the current unifying conflation workflow, I'm not sure how this situation can be handled better....need to think about it for awhile.

bwitham commented 8 years ago

Ming recently fixed an issue with the review counts in the scoring, so the scores have change quite a bit...mostly b/c of increased reviews. Will post those here next. Also, need to go back and run the old version of the tests without poi to poi.

bwitham commented 8 years ago

EDIT

866 may or may not be relevant now...won't know until I look at the output closely again.

bwitham commented 8 years ago

EDIT

Test D is crashing in the RemovePoiToPoiRefs.js script with "Found a REF2 that references a non-existing REF1." I'm not sure its worth trying to fix at this point yet.

bwitham commented 8 years ago

EDIT

drew-bower commented 8 years ago

@bwitham would it be helpful to get some GA eye balls on this to work with you?

bwitham commented 8 years ago

Yeah, it might. At this point there are too many reviews being generated, especially in cases where there should be no match happening at all between a poi and a building. I'm trying to pick out the best (easiest) examples from that category to focus on changing hoot to classify them correctly. I'm starting this process with two of the smaller datasets, A and B. The other two datasets, tests C and D, are fairly large and will take more time to sift through.

sisskind commented 8 years ago

@bwitham sounds good -- give a holler if you need extra eyes.

drew-bower commented 8 years ago

Ok let's do it.

@sisskind let's get Berkeley and Walton involved here.

bwitham commented 8 years ago

Just realized I need to redo the starting scores from 6/15/16, given Ming's scoring bug fix. The starting scores are misleading otherwise.

bwitham commented 8 years ago

@sisskind Thanks, will do.

bwitham commented 8 years ago

@sisskind @drew-bower Thought about this some more yesterday... I want to work for the next couple weeks on fixing some immediate issues with the conflation I've seen before having anybody try it out at this point. After I work through those, I'll create some instructions for manually enabling poi to building conflation in the UI and then others can run some datasets through it and provide feedback, which may help me find additional things that can be improved. thanks.

sisskind commented 8 years ago

@bwitham Sounds like a solid plan. Let's keep @curranmapper in the loop as well.

bwitham commented 8 years ago

Summary of what I know (or think I do) at this point after focusing on only test datasets A and B:

Reviews increase when unifying poi to poi is enabled. This makes sense to me, b/c the manual matcher was focusing on poi to polygon building matching, but when poi's are also being compared to each other additional reviews are going to be generated for things the manual matcher simply didn't think were matches at all. Having an option to turn poi to poi off when poi to poly is enabled I guess is one solution to the problem. I'd like to avoid that, b/c how would a user know to turn it off? Or what if a user needed to do poi to poi and poi to poly in the same conflation job? I'm going to focus on the w/o poi to poi test for now, since there's less wrong stuff to sift through. Maybe if I fix some things there, then looking at the w/ poi to poi output won't be as bad.
What was described in #866 (closed it to focus on in this issue instead) is still a problem, even after #882. However, so far I've only found that it accounts for about 2.2% correct score drop in test A only....but still would like to fix it, as it may be more prevalent in the C and D datasets (don't know yet). I'll try to post a more visual example later, but here's one example: * a polygon building in dataset 2 is tagged as a match with another polygon building in dataset1, and a poi in dataset 2 is tagged as a review with both of those buildings. * The building matcher correctly matches the two polygon buildings. * the poi-poly matcher flags the review correctly for the poi against each building * the building merger runs, merging the two buildings into one, creating a new building with a new uuid, and thus invalidates the old building uuid's \ When scoring the poi-poly review, the match comparator sees that the original building uuid's are gone and then scores the poi-poly conflation as wrong, even though it correctly generated both reviews. The only way this could have been kept from happening is if the poi-poly conflation somehow knew about that new merged building uuid, but I'm not sure if that's possible...need to think of what else can be done.
Type matching with poi-poly seems a little loosey goosey, which is causing a lot of reviews. If so, maybe I can tighten it up to cut back on some of these reviews generated for things that would have otherwise been complete misses.
A lot of incorrect scores come in situations where a poi is plopped down in the middle of a group of buildings, where those buildings aren't matched to each other. Since the poi is in the middle of the building group, the match distance to any one building is too high to match. So, we're getting a review where the manual matcher called it a match. Also, if the buildings had been matched to each other, then you'd see the #866 affect anyway. I'm not sure what can be done about this other than generating a review.... BUT...crazy idea, but I wonder if a feature extractor could be created that calculated a "surroundedness" value related to how closely surrounded a poi is by building polygons. The value would be highest when the poi was surrounded by buildings that were within a certain distance...maybe that could help convert some of these reviews into matches...??
None of this data has name attributes. That's making it harder to get enough evidence for a match, but nothing can really be done about that. Possibly, more things can be added to the evidence scoring for poi-poly to make up for this (?)
A lot of the poi-poly review relations end up being invalid due to one or more of their members being dropped in the output. I think this may also be due to #866, but need to look a little more closely. Originally, I had created the RemoveEmptyReviewRelationsVisitor to deal with this, not knowing why it was occuring. Unfortunately, if only one of the relation members was removed, then RemoveEmptyReviewRelationsVisitor doesn't help. I temporarily modified RemoveEmptyReviewRelationsVisitor locally to also remove review relations with only one member so I would have less reviews to look at, however, that wouldn't work as a permanent change, since one member review relations can be valid in certain situations. Also, its strange that removing them isn't affecting the scoring, so that's also a red flag. Possibly, since there wouldn't ever be a poi-poly review relation with one member (I don't think). I could create a RemovePoiPolyOneMemberReviewRelationVisitor (?). That seems a little absurd. I guess solving the #866 problem might also solve this problem and be a better approach.

bwitham commented 8 years ago

EDIT

bwitham commented 8 years ago

I began to fall into the trap of correcting too many things at the expense of giving up more reviews to the point where the wrong count kept going up an up, so I'm changing my strategy a bit. Also, I've made this a little easier by re-doing all the scoring with the building to building and poi to poi removed, so I can just focus on poi to building. Will repost starting and current scores.

bwitham commented 8 years ago

edit

bwitham commented 8 years ago

Side note: was able to fix the crashing D dataset test when removing poi to poi refs by correcting a bad REF2 tag in the dataset

bwitham commented 8 years ago

Assuming some of the changes I've made can be converted from hardcoded ones to schema changes, I'm fairly happy with the A and B results now...although, those datasets are fairly small.

The C and D results aren't great but may be as good as I'm going to be able to get them, as I've run out of ideas on making them conflate better for now. To be fair, they do have fairly poor type and address attribution for the most part.

I'm going to try and run the features through weka and see if anything pops out. After that, clean up the code and get some feedback on it.

bwitham commented 8 years ago

Extracted features for A, B, and C. Nothing great so far, but maybe will help with some small improvements. Extracting features for D results in a seg fault, so will look into that soon.

drew-bower commented 8 years ago

Are we at the point of getting some GA eyeballs on these examples? Specifically to help with the poor type and address information problem?

bwitham commented 8 years ago

The test data has manual matches in it and I'm guessing was specifically selected with poor attribution to see how well hoot could do against real world datasets...so no changes need to be made to the test data.

But, yeah, after I finish messing with weka I can clean this up and push it to a branch where it can be tried out by others on the command line with whatever data they have. I'll do that this week, then will see if Surratt has any ideas for things to try that I didn't think of...after that, I think I this may be as good as its going to get.

drew-bower commented 8 years ago

Ok so the test cases were challenging.. Got it Let us know when eyeballs are appropriate.

bwitham commented 8 years ago

Using factors with Weka yielded no measurable improvement. Now in the process of cleaning up the needed schema changes. After that will push to a branch for review and testing.

bwitham commented 8 years ago

https://github.com/ngageoint/hootenanny/wiki/POI-Building-Conflation-Prototype

@drew-bower Above is a link with instructions on how this conflation can be tried from the command line, if you know anyone interested in it. It may or may not get merged into the production code soon, depending on how the review for it goes. I'm switching off this for awhile to look at the ME conflate workflow related tasks.

sisskind commented 8 years ago

@bwitham I'm back Monday and can give it a whirl.

drew-bower commented 8 years ago

I'll pass this one off to @curranmapper.

bwitham commented 8 years ago

Talked with Rob today...he's trying it out.

bwitham commented 8 years ago

Got some more ideas from Jason on how to clean this up today, so will keep the code in the branch for now and keep working on it in conjunction with the other things I'm working on.

bwitham commented 8 years ago

Notes 8/23/16

reinstate w/ building to building and poi to poi regression test - DONE
arbitrary 100m poi ce - remove it - DONE
poipoly match creator index: needs updates - DONE
- should be indexing only building ways or relations
determine if matchcomparator building merging thing during poi to poly only conflation is a bug or not and fix if so - DONE
clean up invalid reviews - DONE
use custom match and review distances similar to poi to poi
- calc distances for ref1, ref2, review and graph to get ideas of what values to use
test for feature building surroundedness by calculating distance to a convex poly (1.4% in dataset C?)
- calc alpha shape and use very large large alpha value to get the convex poly
make the match dist thresh dynamic based on building density
- use the input map to calculate the density
add test cases; test for matches with poi's in ds 1 and 2

correct any questionable manual matches as you go along

note: generic implementation - no need for it due to we've reduced the scope of the types of features we're conflating

bwitham commented 8 years ago

For the record, cleaned up ~dozen manual matches in dataset C that seemed erroneous to me. Versioned the new data as v3.

bwitham commented 8 years ago

Changed four records in dataset D where the poi was manually matched to a building way node with address info instead of the building poly containing the node.

Changed one record that didn't seem to be a reasonable manual match.

Versioned the new data as v2.

bwitham commented 8 years ago

I'm going to explore address matching one more time before working more on #1001. I saw a fairly small increase in performance with dataset C by using it. I saw a decrease in performance in dataset D when using it, but looking at the increased wrong, I'm strongly disagreeing with some of the manual matches involved, so am going to go through and do some more cleanup.

bwitham commented 8 years ago

So, address matching does seem to end up being significant in dataset D (need to get the final scores to verify exactly how significant, though). The reasons it hasn't been up until now:

Many addresses were embedded in building way nodes, rather than the building polys themselves. The matcher was missing them before and has now been corrected to handle this situation.
Going through the matches, there are several manual matches that had both matching type or name and matching address and were pretty clearly matches when adding the imagery, but they were recorded as misses. Those have been updated. Similarly, there were instances where the poi was close to the poly and they had a matching address and nothing else in common. In my mind, that qualifies for at least a review. Those have also been corrected.

bwitham commented 8 years ago

Figured out as I was working on custom match/review distances that there are several manual matches I was getting wrong b/c a POI was being matched to an area surrounding a group of buildings rather than the building themselves. I wasn't catching it b/c the pre-clean script being used was removing all poly areas (non-buildings). Keeping the areas, causes hoot to get those type of matches correct but does result in an overall drop in scores, so need to figure out why.

bwitham commented 8 years ago

So, several of the poi to area poly's mismatches are valid issues with the conflation. I have some ideas on how to match them better and am working on it now. Some of the wrong were due to area to area (way to way) manual matches, which this matcher does not handle so added script logic to remove those matches from the input.

bwitham commented 8 years ago

forgot to mention that ongoing commits for this have switched from the 205 branch to the 205-test branch to allow some stability for anyone using the prototype code off of the 205 branch.

curranMapper commented 8 years ago

As far as creating additional training data, have reached out for nominations for user datasets to DG Tampa and VGI team to create additional training data with. Will push forward with creating training set with the best we have available after schedule discussion with @bwitham @kweint on Monday.

ngageoint / hootenanny

POI to polygon conflation initial implementation (CLOSED) #205

866 may or may not be relevant now...won't know until I look at the output closely again.