openstreetmap / osmdbt

OSM Database Replication Tools
GNU General Public License v3.0
24 stars 6 forks source link

Use far fewer SQL queries #28

Closed joto closed 4 years ago

joto commented 4 years ago

See #24.

@mmd-osm Want to review this PR?

I thought about doing only n queries at a time but figured out this is for minutely diffs and there shouldn't be too many things to query. But maybe we have to do this to be on the safe side?

mmd-osm commented 4 years ago

I did a quick retest of my previous performance test case with 0 nodes, 0 ways and 500k relations, and got the following error:

[ 0:01] Got 0 nodes, 0 ways, 500000 relations. [ 0:01] Populating changeset cache... [ 0:01] Got 1 changesets. [ 0:01] Opening output file '/tmp/new-change.osc.gz'... [ 0:01] Processing 0 nodes, 0 ways, 500000 relations... ERROR: Syntax error at or near »)« LINE 1: WITH wanted(id, version) AS (VALUES) SELECT w.id, w.version,...

I think all process_{nodes,ways,relations} methods need an extra check for objs being empty. Also, for some reason the "assert" didn't work here, it turned out to be a "no op".

But maybe we have to do this to be on the safe side?

Originally, I had some sort of tunable parameter in mind, which could be set according to performance measurements (minimum runtime while keeping a reasonable overall memory consumption). Given that we're looking at roughly 500k object id/version pairs in the worst case only, we could check if this works still fine, or if there's whatever reason to implement additional logic to use a smaller package size.

joto commented 4 years ago

Okay, I fixed the problem you had. The assert works for me. If it didn't for you, you most likely didn't compile in debug mode.

mmd-osm commented 4 years ago

With your fix in place, I'm able to run my 500k relations test case again. This time it took 11s in total. Another test run with 100k relations took about 2s, and another test with 250k relations took 9s.

In general, this is the major improvement I was hoping to see. We should keep an eye on those runtimes in a more production like environment when processing log files of different sizes. At one point those runtimes seem to degrade a bit for larger files, but that may be entirely related to my local set up.

Besides performance, it probably makes sense to spend a bit more time on #19 now, with a number of different use cases and test files. I don't see anything obvious in the code at the moment, nevertheless more functional testing is needed, as that is the way I typically find more bugs.

[ 0:00] Got 0 nodes, 0 ways, 500000 relations. [ 0:00] Populating changeset cache... [ 0:00] Got 1 changesets. [ 0:00] Opening output file '/tmp/new-change.osc.gz'... [ 0:00] Processing 0 nodes, 0 ways, 500000 relations... [ 0:11] Wrote and synced output file. [ 0:11] Writing state file '/tmp/new-state.txt'... [ 0:11] Wrote and synced state file. [ 0:11] Current memory used: 339 MBytes [ 0:11] Peak memory used: 461 MBytes [ 0:11] Creating directories... [ 0:11] Moving files into their final locations... [ 0:11] Renaming log file 'osm-repl-2020-08-04T19:28:50Z-lsn-6D-5882B450.log'... [ 0:11] All done. [ 0:11] Done.

joto commented 4 years ago

I have squashed the commits and merged. Closing here.