ngageoint / hootenanny

Hootenanny conflates multiple maps into a single seamless map.
GNU General Public License v3.0
352 stars 74 forks source link

Convert changeset derivation to be non-memory bound #1759

Closed bwitham closed 6 years ago

bwitham commented 7 years ago

DeriveChangesetCmd is memory bound both for element sorting and reading/writing. Is it not memory bound for changeset derivation, though. Rather than swap in osmosis for the sorting to solve the problem, like was done with multiary ingest, will attempt to first write a real external memory sorter as part of #2596.

ConflateCmd also has some derive changeset logic that needs include the changes from this task, so will try to tackle that as well.During conflation all data has to be in memory anyway, so we don't gain anything by wiring up ExternalMergeElementSorter to ConflateCmd.

See MultiaryIngester for some streaming examples with changeset derivation.

bwitham commented 6 years ago

@mattjdnv This will be necessary for the the workflows discussed yesterday. Without this, large changesets can't be generated from hoot.

bwitham commented 6 years ago

@mattjdnv @mschicker Disregard what I said earlier today. We don't need this task for our current conflation effort. Instead, we can wire up diff conflation to use the ChangesetProvider interface to stream the changes similar to how its done here:

https://github.com/ngageoint/hootenanny/blob/develop/hoot-rnd/src/main/cpp/hoot/rnd/io/MultiaryIngester.cpp

bwitham commented 6 years ago

I think this has been done in recent work.

bwitham commented 6 years ago

Looking at how the grail resource uses changeset-derive, I think this task still needs to be done.

bwitham commented 6 years ago

@mattjdnv Actually, I should verify with you before I spend time on this. I think 2519 is the branch you are working out of for grail (is that right?). I see a new Java class, DeriveChangesetCommand.java. Is that class being used in the workflow? If so, I should complete this issue...if not, I won't worry about it. thanks.

mattjdnv commented 6 years ago

Yes, 2519 has my changes.

I looked at the existing derive changeset command (here) and ended up taking 90% of it for my command. I should have refactored the existing command so that it was not hard coded for using the OSMAPI_DB_URL as input1. I needed to be able to specify input1 and input2. Both of these are files but they shouldn't need to be.

I used my class in the workflow since I was just trying to get it to work. I think that there are a few commands that could be refactored so they can be used by a number of different workflows. e.g. my deriveChangeset, applyChangeset etc

bwitham commented 6 years ago

Ok, got it. So, at some point your workflow is calling 'hoot changeset-derive'. Right now changeset-derive reads all the data into memory and then spits out the changeset. I'll change it work like the convert command does (for most formats), where it streams features in one at a time and doesn't have a chance to blow out the memory for large datasets...been meaning to do that for awhile for this command... probably will be very similar to how it was done here: https://github.com/ngageoint/hootenanny/blob/master/hoot-rnd/src/main/cpp/hoot/rnd/io/MultiaryIngester.cpp#L342 thanks.

mattjdnv commented 6 years ago

Yes. In the grailResource.java file I have this:

 try {
            // Run changeset-derive
            params.setInput1(HOOTAPI_DB_URL + "/" + input1);
            params.setInput2(HOOTAPI_DB_URL + "/" + input2);

            File changeSet = new File(workDir,"diff.osc");
            if (changeSet.exists()) changeSet.delete();
            params.setOutput(changeSet.getAbsolutePath());
            ExternalCommand makeChangeset = grailCommandFactory.build(mainJobId,params,debugLevel,DeriveChangesetCommand.class,this.getClass());
            workflow.add(makeChangeset);

            // Apply changeset
            params.setPushUrl(RAILSPORT_PUSH_URL);
            ExternalCommand applyChange = grailCommandFactory.build(mainJobId,params,debugLevel,ApplyChangesetCommand.class,this.getClass());
            workflow.add(applyChange);

            // Now roll the dice and run everything.....
            jobProcessor.submitAsync(new Job(mainJobId, workflow.toArray(new Command[workflow.size()])));
        }

and the changeset derive function calls:

        String command = "hoot changeset-derive --${DEBUG_LEVEL} ${HOOT_OPTIONS} ${INPUT1} ${INPUT2} ${OSC_FILE} --stats";
bwitham commented 6 years ago

work for this will be done in the 2596 branch

bwitham commented 6 years ago

Have been doing this at the same time as 2596, so its basically done now. Just fixing the alg in 2596.

bwitham commented 6 years ago

Running into an issue here where when using pbf internally for the external sorting, the node coords sent back are slightly different than what's passed in, probably due to compression. They're still well within the accuracy specs for hoot, but we use a straight file diff during the test changeset comparison rather than a hoot map comparison, so the test fails. May need a changeset comparator in addition to a map comparator...

bwitham commented 6 years ago

Decided that since we're generally creating a changeset to be applied back to an OSM API db (not a hoot api db), that creating the changeset comparator is a waste of time. Everything is working now with changeset-derive....will move onto the conflate command.

Am noticing that results coming back from db partial readers aren't correctly sorted by ID as I expected them to be....going to look into that for a bit, although since they're currently being sorted anyway in changeset-derive, can always tackle the problem later.

bwitham commented 6 years ago

merged