Closed bwitham closed 6 years ago
@mattjdnv This will be necessary for the the workflows discussed yesterday. Without this, large changesets can't be generated from hoot.
@mattjdnv @mschicker Disregard what I said earlier today. We don't need this task for our current conflation effort. Instead, we can wire up diff conflation to use the ChangesetProvider interface to stream the changes similar to how its done here:
I think this has been done in recent work.
Looking at how the grail resource uses changeset-derive, I think this task still needs to be done.
@mattjdnv Actually, I should verify with you before I spend time on this. I think 2519 is the branch you are working out of for grail (is that right?). I see a new Java class, DeriveChangesetCommand.java. Is that class being used in the workflow? If so, I should complete this issue...if not, I won't worry about it. thanks.
Yes, 2519 has my changes.
I looked at the existing derive changeset command (here) and ended up taking 90% of it for my command. I should have refactored the existing command so that it was not hard coded for using the OSMAPI_DB_URL as input1. I needed to be able to specify input1 and input2. Both of these are files but they shouldn't need to be.
I used my class in the workflow since I was just trying to get it to work. I think that there are a few commands that could be refactored so they can be used by a number of different workflows. e.g. my deriveChangeset, applyChangeset etc
Ok, got it. So, at some point your workflow is calling 'hoot changeset-derive'. Right now changeset-derive reads all the data into memory and then spits out the changeset. I'll change it work like the convert command does (for most formats), where it streams features in one at a time and doesn't have a chance to blow out the memory for large datasets...been meaning to do that for awhile for this command... probably will be very similar to how it was done here: https://github.com/ngageoint/hootenanny/blob/master/hoot-rnd/src/main/cpp/hoot/rnd/io/MultiaryIngester.cpp#L342 thanks.
Yes. In the grailResource.java file I have this:
try {
// Run changeset-derive
params.setInput1(HOOTAPI_DB_URL + "/" + input1);
params.setInput2(HOOTAPI_DB_URL + "/" + input2);
File changeSet = new File(workDir,"diff.osc");
if (changeSet.exists()) changeSet.delete();
params.setOutput(changeSet.getAbsolutePath());
ExternalCommand makeChangeset = grailCommandFactory.build(mainJobId,params,debugLevel,DeriveChangesetCommand.class,this.getClass());
workflow.add(makeChangeset);
// Apply changeset
params.setPushUrl(RAILSPORT_PUSH_URL);
ExternalCommand applyChange = grailCommandFactory.build(mainJobId,params,debugLevel,ApplyChangesetCommand.class,this.getClass());
workflow.add(applyChange);
// Now roll the dice and run everything.....
jobProcessor.submitAsync(new Job(mainJobId, workflow.toArray(new Command[workflow.size()])));
}
and the changeset derive function calls:
String command = "hoot changeset-derive --${DEBUG_LEVEL} ${HOOT_OPTIONS} ${INPUT1} ${INPUT2} ${OSC_FILE} --stats";
work for this will be done in the 2596 branch
Have been doing this at the same time as 2596, so its basically done now. Just fixing the alg in 2596.
Running into an issue here where when using pbf internally for the external sorting, the node coords sent back are slightly different than what's passed in, probably due to compression. They're still well within the accuracy specs for hoot, but we use a straight file diff during the test changeset comparison rather than a hoot map comparison, so the test fails. May need a changeset comparator in addition to a map comparator...
Decided that since we're generally creating a changeset to be applied back to an OSM API db (not a hoot api db), that creating the changeset comparator is a waste of time. Everything is working now with changeset-derive....will move onto the conflate command.
Am noticing that results coming back from db partial readers aren't correctly sorted by ID as I expected them to be....going to look into that for a bit, although since they're currently being sorted anyway in changeset-derive, can always tackle the problem later.
merged
DeriveChangesetCmd is memory bound both for element sorting and reading/writing. Is it not memory bound for changeset derivation, though. Rather than swap in osmosis for the sorting to solve the problem, like was done with multiary ingest, will attempt to first write a real external memory sorter as part of #2596.
ConflateCmd also has some derive changeset logic that needs include the changes from this task, so will try to tackle that as well.During conflation all data has to be in memory anyway, so we don't gain anything by wiring up ExternalMergeElementSorter to ConflateCmd.See MultiaryIngester for some streaming examples with changeset derivation.