Open cvonsee opened 4 years ago
Hmm, the logic does seem flawed. I think this was some of the first code written before I even decided to create Osmosis. I just wanted something faster than the existing Perl script that existed at the time. As a result it's pretty hacky, a class with over 1300 lines 😱
The most efficient approach would be to do a single commit at the end but for very large imports this will fail (I think, it's been a while). So there is a need to do the occasional interim commit.
Setting autoCommit to true will cause the database to do a commit after every single insert which is also very inefficient (okay, perhaps it's doing this anyway but it shouldn't be).
It looks to me like the dtCtx.commit()
call at the end of each flushXXX
method is in the wrong place. I think it should be inside the if(complete)
clause instead of at the end of the method. If that was the case a commit would only occur after every 100,000 records as originally intended instead of after processing every single entity.
If you really want to improve the performance of this class the data should be inserted using the COPY command instead of individual inserts. The performance improvement is dramatic. It only works for PostgreSQL but I'd be surprised if anybody uses MySQL any more (or if it even still works).
See https://github.com/openstreetmap/osmosis/blob/master/osmosis-pgsnapshot/src/main/java/org/openstreetmap/osmosis/pgsnapshot/v0_6/PostgreSqlCopyWriter.java for an example of how this is done.
Moving to commits at 100K record intervals is fairly easy - and would not require the two-pass approach that COPY would - so that might be a good first step pending a complete restructuring of the code. If memory is not an issue then you could even go to a larger interval (500K recs?) between commits, but that would probably require specifying a max heap size at execution time.
At CycleStreets we do still use MySQL with Osmosis (it feels a bit lonely).
Towards the end of April 2021, regular imports into MySQL 5.7 stopped working. The error was: Unable to establish a database connection.
. The cause of that problem remains undiagnosed, but must have been due to some routine package upgrade.
However imports into MySQL 8.0 were working on some small test servers, and so this triggered a switch to doing the main builds with the new version.
They are now working again, but running very slowly compared to when they were working in 5.7. The key problem is that there is no concurrency during the --read-pbf
tasks despite using --merge --buffer --write-apidb
.
There are lots of threads created according to htop
but they are all held up by one thread that does 97% of the processing.
Here's the osmosis command:
/usr/local/bin/osmosis
--read-pbf europe/austria-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath
--read-pbf europe/belgium-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf europe/croatia-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf europe/czech-republic-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf europe/denmark-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf europe/estonia-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf europe/france-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf europe/germany-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf africa/ghana-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --merge
--read-pbf europe/great-britain-latest.osm.pbf --tag-filter accept-ways highway=* amenity=parking,bicycle_parking access=* aerialway=* route=ferry man_made=pier,jetty leisure=* landuse=* --tag-filter reject-ways highway=motorway,bus_guideway,construction bicycle=use_sidepath --bounding-polygon file="europe/great-britain.cs.poly" --merge
... (about a dozen more countries)
--merge --buffer --write-apidb dbType=mysql populateCurrentTables=no host=localhost database= user= password= validateSchemaVersion=no
The changes introduced by the merging of https://github.com/openstreetmap/osmosis/pull/44 have a disastrous effect on performance, at least in MySQL.
That is because they produce over 100 COMMIT
statements per INSERT
.
Undoing the changes from https://github.com/openstreetmap/osmosis/commit/998e6fded00c2c5f78c98ae9468b3599ee57d745 restores fast performance as I have verified in a custom Osmosis build.
These other changes also help reduce warnings when running the custom osmosis, but are not strictly necessary for it to work:
Loading class 'com.mysql.jdbc.Driver'. This is deprecated. The new driver class is 'com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
+ "&useSSL=false"
to the url that connects to 'jdbc:mysql://' at https://github.com/openstreetmap/osmosis/blob/4b1af52a5d9a5eca76e0550b464ea4ce11260e71/osmosis-apidb/src/main/java/org/openstreetmap/osmosis/apidb/common/DatabaseContext.java#L129
This gets rid of the warning: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
and this exception:
EXCEPTION STACK TRACE:
BEGIN NESTED EXCEPTION
javax.net.ssl.SSLException MESSAGE: closing inbound before receiving peer's close_notify
STACKTRACE:
javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:751) at java.base/sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:730) at com.mysql.cj.protocol.a.NativeProtocol.quit(NativeProtocol.java:1313) at com.mysql.cj.NativeSession.quit(NativeSession.java:182) at com.mysql.cj.jdbc.ConnectionImpl.realClose(ConnectionImpl.java:1743) at com.mysql.cj.jdbc.ConnectionImpl.close(ConnectionImpl.java:717) at org.openstreetmap.osmosis.apidb.common.DatabaseContext.close(DatabaseContext.java:478) at org.openstreetmap.osmosis.apidb.v0_6.ApidbWriter.close(ApidbWriter.java:1158) at org.openstreetmap.osmosis.core.buffer.v0_6.EntityBuffer.run(EntityBuffer.java:91) at java.base/java.lang.Thread.run(Thread.java:829)
END NESTED EXCEPTION
It appears that the "flushxxxxx()" methods (
flushNodes()
,flushNodeTags()
,flushWays()
, etc.) all issue a SQLcommit()
at the end of the method, even when the logic to actually do SQL inserts of data is not executed.For example, in flushNodes() the
while (nodeBuffer.size() >= INSERT_BULK_ROW_COUNT_NODE)
loop (starting around line 586) that inserts data into the DB is only executed every 100 times the method is called (100 being the value ofINSERT_BULK_ROW_COUNT_NODE
), and yet thedbctx.commit()
call is always executed at the end of the method. In addition, the 'complete' flag is false far more often than not and so the cleanup logic in that block is not executed and adbctx.commit()
call is not needed.In addition, there is a commit near line 607 inside the
while (nodeBuffer.size() ...
loop that is executed every 100K records (with 100,000 being the value ofTRANSACTION_SIZE
:dbCtx.commit()
simply checks to see if a DB connection is present (which appears to be always the case) and then issues the SQL commit() call.Redundant
dbctx.commit()
calls are also made when sub-functions such asaddNodeTags()
(which callsflushNodeTags()
) are called, such as near line 616. Committing of node tag inserts (or way tag inserts, or whatever) can be done in the same commit that handles the node or way as long as it's done on the same database connection.I would propose that we either conditionally execute
dbctx.commit()
only if inserts have actually been executed, or just movedbctx.commit()
calls to just outside the logic that actually does inserts. When the DB connection is created 'autoCommit' is set to 'false', so since we're doing batch inserts it's conceivably an option to just turn autocommit on and let the database handle this.Updating these methods to restructure the logic and only issue
dbctx.commit()
requests when actually needed would likely speed up database insert processing by a non-trivial amount.