openstreetmap / osmosis

Osmosis is a command line Java application for processing OSM data.
http://wiki.openstreetmap.org/wiki/Osmosis
679 stars 207 forks source link

org.openstreetmap.osmosis.apidb.v0_6.ApidbWriter issues SQL commit even when not needed #82

Open cvonsee opened 4 years ago

cvonsee commented 4 years ago

It appears that the "flushxxxxx()" methods (flushNodes(), flushNodeTags(), flushWays(), etc.) all issue a SQL commit() at the end of the method, even when the logic to actually do SQL inserts of data is not executed.

For example, in flushNodes() the while (nodeBuffer.size() >= INSERT_BULK_ROW_COUNT_NODE) loop (starting around line 586) that inserts data into the DB is only executed every 100 times the method is called (100 being the value of INSERT_BULK_ROW_COUNT_NODE), and yet the dbctx.commit() call is always executed at the end of the method. In addition, the 'complete' flag is false far more often than not and so the cleanup logic in that block is not executed and a dbctx.commit() call is not needed.

In addition, there is a commit near line 607 inside the while (nodeBuffer.size() ... loop that is executed every 100K records (with 100,000 being the value of TRANSACTION_SIZE:

if (transactionSizeCount % TRANSACTION_SIZE == 0) {
    dbCtx.commit();
}

dbCtx.commit() simply checks to see if a DB connection is present (which appears to be always the case) and then issues the SQL commit() call.

Redundant dbctx.commit() calls are also made when sub-functions such as addNodeTags() (which calls flushNodeTags()) are called, such as near line 616. Committing of node tag inserts (or way tag inserts, or whatever) can be done in the same commit that handles the node or way as long as it's done on the same database connection.

I would propose that we either conditionally execute dbctx.commit() only if inserts have actually been executed, or just move dbctx.commit() calls to just outside the logic that actually does inserts. When the DB connection is created 'autoCommit' is set to 'false', so since we're doing batch inserts it's conceivably an option to just turn autocommit on and let the database handle this.

Updating these methods to restructure the logic and only issue dbctx.commit() requests when actually needed would likely speed up database insert processing by a non-trivial amount.

brettch commented 3 years ago

Hmm, the logic does seem flawed. I think this was some of the first code written before I even decided to create Osmosis. I just wanted something faster than the existing Perl script that existed at the time. As a result it's pretty hacky, a class with over 1300 lines 😱

The most efficient approach would be to do a single commit at the end but for very large imports this will fail (I think, it's been a while). So there is a need to do the occasional interim commit.

Setting autoCommit to true will cause the database to do a commit after every single insert which is also very inefficient (okay, perhaps it's doing this anyway but it shouldn't be).

It looks to me like the dtCtx.commit() call at the end of each flushXXX method is in the wrong place. I think it should be inside the if(complete) clause instead of at the end of the method. If that was the case a commit would only occur after every 100,000 records as originally intended instead of after processing every single entity.

brettch commented 3 years ago

If you really want to improve the performance of this class the data should be inserted using the COPY command instead of individual inserts. The performance improvement is dramatic. It only works for PostgreSQL but I'd be surprised if anybody uses MySQL any more (or if it even still works).

See https://github.com/openstreetmap/osmosis/blob/master/osmosis-pgsnapshot/src/main/java/org/openstreetmap/osmosis/pgsnapshot/v0_6/PostgreSqlCopyWriter.java for an example of how this is done.

cvonsee commented 3 years ago

Moving to commits at 100K record intervals is fairly easy - and would not require the two-pass approach that COPY would - so that might be a good first step pending a complete restructuring of the code. If memory is not an issue then you could even go to a larger interval (500K recs?) between commits, but that would probably require specifying a max heap size at execution time.

si-the-pie commented 3 years ago

At CycleStreets we do still use MySQL with Osmosis (it feels a bit lonely).

Towards the end of April 2021, regular imports into MySQL 5.7 stopped working. The error was: Unable to establish a database connection.. The cause of that problem remains undiagnosed, but must have been due to some routine package upgrade.

However imports into MySQL 8.0 were working on some small test servers, and so this triggered a switch to doing the main builds with the new version.

They are now working again, but running very slowly compared to when they were working in 5.7. The key problem is that there is no concurrency during the --read-pbf tasks despite using --merge --buffer --write-apidb. There are lots of threads created according to htop but they are all held up by one thread that does 97% of the processing.

Here's the osmosis command:

/usr/local/bin/osmosis
 --read-pbf europe/austria-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath
 --read-pbf europe/belgium-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf europe/croatia-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf europe/czech-republic-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf europe/denmark-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf europe/estonia-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf europe/france-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf europe/germany-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf africa/ghana-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
 --read-pbf europe/great-britain-latest.osm.pbf  --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --bounding-polygon file="europe/great-britain.cs.poly" --merge
... (about a dozen more countries)
 --merge  --buffer --write-apidb dbType=mysql populateCurrentTables=no host=localhost database= user= password= validateSchemaVersion=no
si-the-pie commented 3 years ago

The changes introduced by the merging of https://github.com/openstreetmap/osmosis/pull/44 have a disastrous effect on performance, at least in MySQL.

That is because they produce over 100 COMMIT statements per INSERT.

Undoing the changes from https://github.com/openstreetmap/osmosis/commit/998e6fded00c2c5f78c98ae9468b3599ee57d745 restores fast performance as I have verified in a custom Osmosis build.

These other changes also help reduce warnings when running the custom osmosis, but are not strictly necessary for it to work:

EXCEPTION STACK TRACE:

BEGIN NESTED EXCEPTION

javax.net.ssl.SSLException MESSAGE: closing inbound before receiving peer's close_notify

STACKTRACE:

javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:751) at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:730) at com.mysql.cj.protocol.a.NativeProtocol.quit(NativeProtocol.java:1313) at com.mysql.cj.NativeSession.quit(NativeSession.java:182) at com.mysql.cj.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:1743) at com.mysql.cj.jdbc.ConnectionImpl.close(ConnectionImpl.java:717) at org.openstreetmap.osmosis.apidb.common.DatabaseContext.close(DatabaseContext.java:478) at org.openstreetmap.osmosis.apidb.v0_6.ApidbWriter.close(ApidbWriter.java:1158) at org.openstreetmap.osmosis.core.buffer.v0_6.EntityBuffer.run(EntityBuffer.java:91) at java.base/java.lang.Thread.run(Thread.java:829)

END NESTED EXCEPTION