ngageoint / hootenanny

Hootenanny conflates multiple maps into a single seamless map.
GNU General Public License v3.0
354 stars 74 forks source link

POI to polygon conflation initial implementation (CLOSED) #205

Closed sisskind closed 7 years ago

sisskind commented 8 years ago

POI's are the same as defined in the current hoot POI to POI conflation. Polygons are defined as buildings and areas (e.g. a park).

Point-to-polygon conflation is a desired function for Hootenanny, so that the user has the ability to conflate features that are the same, just in a different geometry. For example, the ability to conflate a building POI with the building footprint.

This is the top level epic for this task.

High level steps:

sisskind commented 8 years ago

https://github.com/DigitalGlobe/VGI-team-repo/issues/1

bwitham commented 8 years ago

During speaking with Jason last month...some work on this has already been done, but much more needs to be done. So, first task is to figure out what the current state of the code is for this.

bwitham commented 8 years ago

Briefly perused the existing code and looked at the model training test in nightly. The test is missing some scripts in source control, so will start tracking those down next week.

bwitham commented 8 years ago

Got the scripts from Jason. Working on getting the poi building model training test running again.

bwitham commented 8 years ago

starting scores published after switching poi to poi over from PLACES to Unifying (poi to poi much improved after switch).

For tests B and C, not a huge difference in conflation quality when running with POI to POI vs without.

For test A, also similar totals between the two, but a much better correct percentage when running without POI to POI.

In test D, much better results when running with POI to POI. Also, since test D uses a much larger dataset than the others, running with POI to POI results in a very noticeable 4x-5x longer runtime compared to without POI to POI.

bwitham commented 8 years ago

EDIT

bwitham commented 8 years ago

Sorting through hoot:wrong in test output, test by test, now to see if there are any low hanging fruit situations where the conflation can be improved.

bwitham commented 8 years ago

After having looked at only a small part of the output in the A and B tests, so far am seeing several instances where a single poi building was matched against a few or more polygon buildings by the manual matcher, but hoot calls it a review. Given the rules layed out for poi building, this is acceptable I think, and obviously, isn't counting against the overall correctness score (correct + reviews). Will start hunting down hoot:wrong misses next after I fix some other bugs found while doing looking at test output.

bwitham commented 8 years ago

Resuming output examination after #854 and #856.

Another thing to start thinking about is why the scores drop when not running RemoveIrrelevants.js, which removes non-poi's and non-buildings from the input. Running that won't be an option in a production environment, so need to think about how it can be removed from the tests.

bwitham commented 8 years ago

EDIT

bwitham commented 8 years ago

Digging into hoot:wrong and looking at the first set of items where a match was expected but hoot classified as miss, I'm seeing a poi that is involved in reviews with the two buildings that it didn't match too. This seems to me like it should fall under expected match, got review instead, which would take it out of the wrong category. Looking more closely at MatchComparator to see if I'm just misunderstanding the output.

bwitham commented 8 years ago

I may have found proof that #167 needs more attention after all....or maybe I've found some other kind of issue altogether.

For the poi and two buildings mentioned in the previous comment, I'm seeing a case where the poi gets marked as needing to be reviewed against each building (as it should) by the poi poly matcher, but MatchComparator has null element id's for both buildings, so it ignores those two reviews and classifies them as wrong. The two buildings in question were merged (presumably by the building merger), and that is why the original building uuid's don't exist (new concatenated uuid's exist in their place).

So, technically applying the wrong score may be correct in this situation b/c the building merge occurred after the poi to building review was created. I wonder if there is some way to order the matching/merging so that the poi to building reviews in this example have the latest building uuid's and, therefore, will no longer fall into the wrong category...

I could also turn off the building matching, but then scores will drop significantly.

bwitham commented 8 years ago

There is a new unique situation here where since both poi poly and building conflation match buildings, the merging done by one can affect the performance of the other. In this case, the building merging is having an adverse affect on the poi to poly conflation. In the current unifying conflation workflow, I'm not sure how this situation can be handled better....need to think about it for awhile.

bwitham commented 8 years ago

Ming recently fixed an issue with the review counts in the scoring, so the scores have change quite a bit...mostly b/c of increased reviews. Will post those here next. Also, need to go back and run the old version of the tests without poi to poi.

bwitham commented 8 years ago

EDIT

866 may or may not be relevant now...won't know until I look at the output closely again.

bwitham commented 8 years ago

EDIT

Test D is crashing in the RemovePoiToPoiRefs.js script with "Found a REF2 that references a non-existing REF1." I'm not sure its worth trying to fix at this point yet.

bwitham commented 8 years ago

EDIT

drew-bower commented 8 years ago

@bwitham would it be helpful to get some GA eye balls on this to work with you?

bwitham commented 8 years ago

Yeah, it might. At this point there are too many reviews being generated, especially in cases where there should be no match happening at all between a poi and a building. I'm trying to pick out the best (easiest) examples from that category to focus on changing hoot to classify them correctly. I'm starting this process with two of the smaller datasets, A and B. The other two datasets, tests C and D, are fairly large and will take more time to sift through.

sisskind commented 8 years ago

@bwitham sounds good -- give a holler if you need extra eyes.

drew-bower commented 8 years ago

Ok let's do it.

@sisskind let's get Berkeley and Walton involved here.

bwitham commented 8 years ago

Just realized I need to redo the starting scores from 6/15/16, given Ming's scoring bug fix. The starting scores are misleading otherwise.

bwitham commented 8 years ago

@sisskind Thanks, will do.

bwitham commented 8 years ago

@sisskind @drew-bower Thought about this some more yesterday... I want to work for the next couple weeks on fixing some immediate issues with the conflation I've seen before having anybody try it out at this point. After I work through those, I'll create some instructions for manually enabling poi to building conflation in the UI and then others can run some datasets through it and provide feedback, which may help me find additional things that can be improved. thanks.

sisskind commented 8 years ago

@bwitham Sounds like a solid plan. Let's keep @curranmapper in the loop as well.

bwitham commented 8 years ago

Summary of what I know (or think I do) at this point after focusing on only test datasets A and B:

bwitham commented 8 years ago

EDIT

bwitham commented 8 years ago

I began to fall into the trap of correcting too many things at the expense of giving up more reviews to the point where the wrong count kept going up an up, so I'm changing my strategy a bit. Also, I've made this a little easier by re-doing all the scoring with the building to building and poi to poi removed, so I can just focus on poi to building. Will repost starting and current scores.

bwitham commented 8 years ago

edit

bwitham commented 8 years ago

Side note: was able to fix the crashing D dataset test when removing poi to poi refs by correcting a bad REF2 tag in the dataset

bwitham commented 8 years ago

Assuming some of the changes I've made can be converted from hardcoded ones to schema changes, I'm fairly happy with the A and B results now...although, those datasets are fairly small.

The C and D results aren't great but may be as good as I'm going to be able to get them, as I've run out of ideas on making them conflate better for now. To be fair, they do have fairly poor type and address attribution for the most part.

I'm going to try and run the features through weka and see if anything pops out. After that, clean up the code and get some feedback on it.

bwitham commented 8 years ago

Extracted features for A, B, and C. Nothing great so far, but maybe will help with some small improvements. Extracting features for D results in a seg fault, so will look into that soon.

drew-bower commented 8 years ago

Are we at the point of getting some GA eyeballs on these examples? Specifically to help with the poor type and address information problem?

bwitham commented 8 years ago

The test data has manual matches in it and I'm guessing was specifically selected with poor attribution to see how well hoot could do against real world datasets...so no changes need to be made to the test data.

But, yeah, after I finish messing with weka I can clean this up and push it to a branch where it can be tried out by others on the command line with whatever data they have. I'll do that this week, then will see if Surratt has any ideas for things to try that I didn't think of...after that, I think I this may be as good as its going to get.

drew-bower commented 8 years ago

Ok so the test cases were challenging.. Got it Let us know when eyeballs are appropriate.

bwitham commented 8 years ago

Using factors with Weka yielded no measurable improvement. Now in the process of cleaning up the needed schema changes. After that will push to a branch for review and testing.

bwitham commented 8 years ago

https://github.com/ngageoint/hootenanny/wiki/POI-Building-Conflation-Prototype

@drew-bower Above is a link with instructions on how this conflation can be tried from the command line, if you know anyone interested in it. It may or may not get merged into the production code soon, depending on how the review for it goes. I'm switching off this for awhile to look at the ME conflate workflow related tasks.

sisskind commented 8 years ago

@bwitham I'm back Monday and can give it a whirl.

drew-bower commented 8 years ago

I'll pass this one off to @curranmapper.

bwitham commented 8 years ago

Talked with Rob today...he's trying it out.

bwitham commented 8 years ago

Got some more ideas from Jason on how to clean this up today, so will keep the code in the branch for now and keep working on it in conjunction with the other things I'm working on.

bwitham commented 8 years ago

Notes 8/23/16

correct any questionable manual matches as you go along

note: generic implementation - no need for it due to we've reduced the scope of the types of features we're conflating

bwitham commented 8 years ago

For the record, cleaned up ~dozen manual matches in dataset C that seemed erroneous to me. Versioned the new data as v3.

bwitham commented 8 years ago

Changed four records in dataset D where the poi was manually matched to a building way node with address info instead of the building poly containing the node.

Changed one record that didn't seem to be a reasonable manual match.

Versioned the new data as v2.

bwitham commented 8 years ago

I'm going to explore address matching one more time before working more on #1001. I saw a fairly small increase in performance with dataset C by using it. I saw a decrease in performance in dataset D when using it, but looking at the increased wrong, I'm strongly disagreeing with some of the manual matches involved, so am going to go through and do some more cleanup.

bwitham commented 8 years ago

So, address matching does seem to end up being significant in dataset D (need to get the final scores to verify exactly how significant, though). The reasons it hasn't been up until now:

bwitham commented 8 years ago

Figured out as I was working on custom match/review distances that there are several manual matches I was getting wrong b/c a POI was being matched to an area surrounding a group of buildings rather than the building themselves. I wasn't catching it b/c the pre-clean script being used was removing all poly areas (non-buildings). Keeping the areas, causes hoot to get those type of matches correct but does result in an overall drop in scores, so need to figure out why.

bwitham commented 8 years ago

So, several of the poi to area poly's mismatches are valid issues with the conflation. I have some ideas on how to match them better and am working on it now. Some of the wrong were due to area to area (way to way) manual matches, which this matcher does not handle so added script logic to remove those matches from the input.

bwitham commented 8 years ago

forgot to mention that ongoing commits for this have switched from the 205 branch to the 205-test branch to allow some stability for anyone using the prototype code off of the 205 branch.

curranMapper commented 8 years ago

As far as creating additional training data, have reached out for nominations for user datasets to DG Tampa and VGI team to create additional training data with. Will push forward with creating training set with the best we have available after schedule discussion with @bwitham @kweint on Monday.