Use mapreduce to cut down on the SQL statement file generation and SQL statement execution time for an OSM API database ingest

bwitham commented 7 years ago

It currently takes days to generate the SQL file when converting it from a planet file to an executable sql file. It has 21.5 billion records in it (yes billion with a "b") as of 3/24/17. It also takes an unknown amount of time to exec the sql, but I'm sure its fairly long.

We should be able to use some of the hadoop code already in hoot to help speed this process up. It goes without saying that all of this assumes an offline database that has been pre-prepped for bulk load and has no database constraints activated. IMPORTANT: Also, this workflow will have to give up OSM data validation (missing way nodes, etc.) like the current writer does. So this would only ever be used against data that previously existed within a database and had already been validated (planet data).

bwitham commented 7 years ago

@Cellington1 I might be able to ease your pain on the SQL file generation part by doing it with Hadoop, assuming you can spin up a handful of machines for the ingest. Will see...

mattjdnv commented 7 years ago

I'm interested in the on-demand Hadoop options (for a different project) and I have been thinking about how to spin up multiple hadoop nodes on Dockson as part of a Jenkins job.

bwitham commented 7 years ago

You may also want to research Apache Spark as well, which is what MrGeo uses. I've not used Spark personally, but there do appear to be use cases for using both depending on what you're trying to do.

bwitham commented 7 years ago

I wanted to go ahead and do this despite some recent conversations, since it would end up being a huge time saver for the planet load and didn't look too difficult to implement. I have the element sql writing part working.

The problem I'm running into now is how to partition the changesets. The way it needs to be done, I think, is to use a hadoop counter to globally track the number of changesets used and then have each mapper increment the changeset id. Unfortunately hadoop pipes, unlike Java hadoop, will only let you increment a counter and won't let you retrieve its value. I'm going to look into what it would take to implement counter value retrieval pipes, but that code is kind of hard to follow for me.

One kludgy option could be to pick a random changeset ID in a range. Since the planet file goes through ~84K changesets at 50K elements each that number range would have to be huge to avoid ID conflicts. An ID conflict wouldn't be the end of the world...you'd just end up with some changesets larger than 50K elements. But if that happens, what's the point of trying to break things up into multiple changesets in the first place?

That brings me to the only other option if updating pipes doesn't work: just putting everything in one massive changeset. The main problem I can see here is if somewhere in the rails port there is a feature that lets you retrieve things by changeset ID (is there?). But even at 50K elements (they recently updated to 10K, I guess) that would still be prohibitive for some operation, so I don't even know that it matters that much.

bwitham commented 7 years ago

The only changeset operation that would really be affected by a single huge changeset would be changeset download. Well, just don't download that one :-). Changeset query by bounds doesn't work anyway right now for the non-mapreduce writer (it could, if we kept a global bounds, though, I think)...so not worried about that either.

bwitham commented 7 years ago

I'm also assuming that changeset max size checks all occur in the rails port and there are no database restrictions on that... I think that's right.

bwitham commented 7 years ago

Have a version working now that writes out an unordered sql file with a single changeset. Need to verify the sql by dropping constraints on a db, writing it, and then bringing them back up.

bwitham commented 7 years ago

Have successfully written data to a db using the local job runner with constraints disabled and then brought them back up. Let todo:

~~port psql database constraint changes to 1431 and 1446~~
~~add buffering to the reducer~~
~~write a glacial test that run pseudo-distributed against slightly larger data~~
~~implement mapreduce sql exec~~
~~rework parallelism for better record distribution among tasks~~
test on real cluster

will hold off to port 1473 mapreduce changes to 1446 until after 1473 has been merged into develop

bwitham commented 7 years ago

Made some progress on the sql exec part of this today. Did find out that you can't exec sql copy statements from either qt or psql as a string programmatically, since it only works from a file input. I don't want to be wasteful and write all the records to file before writing to the database, so I found a library, libpqxx which lets you do a sql streaming copy programmatically. Will test it out tomorrow, and if it works, then submit it for software approval if its not already approved.

Right now, the parallelism of this is hindered by the unbalanced nature of records in tables (a lot more nodes, than relations in a dataset usually, for instance). Given the planet file, my estimation is that with the current design we may see up to ~5x speed up of the sql file write, but then the performance gains will drop off dramatically after that, since the parallelism works at the table level. I have an idea to make the parallelism work at the record level and still be able to write the setval statements (didn't think I could before). I'm going to try and implement that after I get the sql exec with libpqxx working.

Finishing those two things and then verifying performance on a real cluster should close this out. If the hoot cluster spinup part on EC2 is going to be time consuming to write, then I may shelve that task and finish some more pressing tasks before coming back to it....or see if someone else wants to do it.

bwitham commented 7 years ago

slight change of plans... just learned that libpqxx table writer is deprecated, but it uses an underlying lipq function apparently. So, will try to use that instead. I don't think libpq is a part of postgres (is it?), in which case it still needs approval.

bwitham commented 7 years ago

libpq is actually part of postgres

bwitham commented 7 years ago

libpq is working

bwitham commented 7 years ago

There are some discrepancies when running the mapreduce sql exec in pseudo-distributed...will look into that tomorrow.

bwitham commented 7 years ago

parallelism rework finished

drew-bower commented 7 years ago

Faster?

bwitham commented 7 years ago

Not sure yet, but it should be. I'm going to do a quick test today with the two local machines I have here at home just to make sure its trending faster.

Going to save testing it on EC2 testing until after I work on some more important stuff that needs to be done now...poi/poly and holy grail related tasks...will get to verifying it later this month or someone else can try it out in the meantime they want. End result should be bringing the planet ingest down from several days to a few hours, hopefully.

bwitham commented 7 years ago

@Cellington1 Wrapping this up for now. Its been tested in a pseudo-distributed cluster only so far. I opened #1502 to deal with testing it on an actual cluster.

It should be much more performant than the current ingest process, b/c all the database sql writes are streamed to the database as elements are read in from the pbf in the mapreduce job vs. writing out an entire sql file in a single process and then executing the sql for that whole thing against the database from one machine like we do now. I also added code to auto-enable/disable database constraints before/after the job. Possibly, we could also add some of the other optimizations to the code that are in those other sql scripts Terry wrote (index related, other stuff?).

From what I can tell, spinning up a hoot cluster on EC2 isn't going to be as straight forward as doing it on normal EMR due to the fact hoot runs on hadoop pipes, but I still think its doable given enough effort (wish we still had our old biggeo cluster to test on right now). See my comments so far in #1502.

I'm going to redirect to work on some other things I need to get done but may come back to #1502 at some point in the future.

mschicker commented 7 years ago

Merged #1504. Nicely done!

bwitham commented 7 years ago

To run:

'hoot big-convert input.osm.pbf output.sql'

OR

'hoot big-convert input.osm.pbf osmapidb://hoot:hoottest@localhost:5432/osmapi_test'

type: 'hoot big-convert' for more details

ngageoint / hootenanny

Use mapreduce to cut down on the SQL statement file generation and SQL statement execution time for an OSM API database ingest #1473