possible memoryleak with 1.5.0

verfriemelt-dot-org commented 3 years ago

it seems there is some sort of memory leak with the current release taken from the debian experimental repository.

$ osm2pgsql --log-level=debug --hstore-all --slim --drop --cache=16000 -G --number-processes=4 --flat-nodes=tmp -H 192.168.2.103 -U osm -d osm planet-latest.osm.pbf
2021-07-16 21:05:06  osm2pgsql version 1.5.0
2021-07-16 21:05:06  [0] Database version: 13.3 (Debian 13.3-1)
2021-07-16 21:05:06  [0] PostGIS version: 3.1
2021-07-16 21:05:06  [0] Reading file: planet-latest.osm.pbf
2021-07-16 21:05:06  [0] Started pool with 4 threads.
2021-07-16 21:05:06  [0] Loading persistent node cache from 'tmp'.
2021-07-16 21:05:06  [0] Mid: pgsql, cache=16000
2021-07-16 21:05:06  [0] Setting up table 'planet_osm_nodes'
2021-07-16 21:05:06  [0] Setting up table 'planet_osm_ways'
2021-07-16 21:05:07  [0] Setting up table 'planet_osm_rels'
2021-07-16 21:05:08  [0] Using projection SRS 3857 (Spherical Mercator)
2021-07-16 21:05:08  [0] Using built-in tag transformations
2021-07-16 21:05:08  [0] Setting up table 'planet_osm_point'
2021-07-16 21:05:10  [0] Setting up table 'planet_osm_line'
2021-07-16 21:05:11  [0] Setting up table 'planet_osm_polygon'
2021-07-16 21:05:11  [0] Setting up table 'planet_osm_roads'
Processing: Node(2111400k 5571.0k/s) Way(0k 0.00k/s) Relation(0 0.0/s)[1]    181885 killed

the system has 32GB of memory and i was able to observe the crash with htop. it honored the memorylimit of the cache paramter until it reached that last node seen in the output, it stopped updating and processing the file but increased until system memory was full and it was killed by the OOM.

i restarted the process with osm2pgsql 1.4.1 and it seems to run just fine beyond this point.

the planetfile ( from https://planet.osm.org/pbf/planet-latest.osm.pbf ) in question is this:

$ osmium fileinfo -e planet-latest.osm.pbf
File:
  Name: planet-latest.osm.pbf
  Format: PBF
  Compression: none
  Size: 62962428417
Header:
  Bounding boxes:
    (-180,-90,180,90)
  With history: no
  Options:
    generator=planet-dump-ng 1.2.0
    osmosis_replication_timestamp=2021-06-27T23:59:58Z
    pbf_dense_nodes=true
    pbf_optional_feature_0=Has_Metadata
    pbf_optional_feature_1=Sort.Type_then_ID
    sorting=Type_then_ID
    timestamp=2021-06-27T23:59:58Z
[======================================================================] 100%
Data:
  Bounding box: (-180,-90,180,90)
  Timestamps:
    First: 2005-05-03T13:27:18Z
    Last: 2021-06-27T23:59:58Z
  Objects ordered (by type and id): yes
  Multiple versions of same object: no
  CRC32: not calculated (use --crc/-c to enable)
  Number of changesets: 0
  Number of nodes: 7041839672
  Number of ways: 782228827
  Number of relations: 9054640
  Smallest changeset ID: 0
  Smallest node ID: 1
  Smallest way ID: 37
  Smallest relation ID: 11
  Largest changeset ID: 0
  Largest node ID: 8870746351
  Largest way ID: 958713294
  Largest relation ID: 12893861
  Number of buffers: 10680839 (avg 733 objects per buffer)
  Sum of buffer sizes: 698105454040 (698.105 GB)
  Sum of buffer capacities: 701029613568 (701.029 GB, 100% full)
Metadata:
  All objects have following metadata attributes: version+timestamp+changeset
  Some objects have following metadata attributes: all

joto commented 3 years ago

The --cache 16000 doesn't mean it is supposed to only use 16GB of RAM, it just means that one part of osm2pgsql (the cache) will use a maximum of 16GB. There are other parts that need RAM, so it is totally normal that you get beyond the 16GB. And yes, this is a little bit confusing. I am not sure why this changes in 1.5.0, that's something I will need to look into.

When using the flat node store, we recommend setting the cache to 0, the cache doesn't make much sense in that case and with 16GB more, there is a chance this will run through, though importing the whole planet with anything less than 64GB is going to be a stretch and take a long time.

verfriemelt-dot-org commented 3 years ago

The --cache 16000 doesn't mean it is supposed to only use 16GB of RAM, it just means that one part of osm2pgsql (the cache) will use a maximum of 16GB. There are other parts that need RAM, so it is totally normal that you get beyond the 16GB

yes i know, it crashed with approx 30gb used. i need to check if this will be finished in a reasonable time for my homeproject.

i could try to start this with the cache set to 0 or without a flat-nodes file if this would help to pinpoint the issue. or let me know if i could help to debug this further.

joto commented 3 years ago

I tried this with the two versions 1.4.1 and 1.5.0 on an older planet and the planet you are using (to see whether this is related to some recent change in the planet) and I can see no real difference between all those cases. I have tried this on a 128 GB machine though. Eventually osm2pgsql will use more than 64GB RAM in all those cases. So you'll need a lot of swap memory anyway and it will become too slow to be usable then probably.

There are changes between versions 1.4.1 and 1.5.0 that affect memory use. And it might well be that those changes just push you beyond what your machine can handle, but I don't see a memory leak or something like that.

So for your case: Try with --cache=0, keep the flat-nodes file (will be too slow otherwise), make sure you have at least 64GB swap. I have no idea how long it will be though.

verfriemelt-dot-org commented 3 years ago

its currently creating indices for the tables, its not that slow in summary :)

21:14:28 » easteregg@t480 ~/osm-import $ osm2pgsql --log-level=debug --hstore-all --slim --drop --cache=20000 -G --number-processes=4 --flat-nodes=tmp -H 192.168.2.103 -U osm -d osm planet-latest.osm.pbf
2021-07-16 21:14:29  osm2pgsql version 1.4.1
2021-07-16 21:14:29  Database version: 13.3 (Debian 13.3-1)
2021-07-16 21:14:29  PostGIS version: 3.1
2021-07-16 21:14:29  Reading file: planet-latest.osm.pbf
2021-07-16 21:14:29  Allocating memory for dense node cache
2021-07-16 21:14:29  Allocating dense node cache in one big chunk
2021-07-16 21:14:29  Allocating memory for sparse node cache
2021-07-16 21:14:29  Sharing dense sparse
2021-07-16 21:14:29  Node-cache: cache=20000MB, maxblocks=320000*65536, allocation method=11
2021-07-16 21:14:29  Mid: loading persistent node cache from tmp
2021-07-16 21:14:29  Mid: pgsql, cache=20000
2021-07-16 21:14:29  Setting up table 'planet_osm_nodes'
2021-07-16 21:14:29  Setting up table 'planet_osm_ways'
2021-07-16 21:14:30  Setting up table 'planet_osm_rels'
2021-07-16 21:14:30  Using projection SRS 3857 (Spherical Mercator)
2021-07-16 21:14:30  Using built-in tag transformations
2021-07-16 21:14:30  Setting up table 'planet_osm_point'
2021-07-16 21:14:30  Setting up table 'planet_osm_line'
2021-07-16 21:14:31  Setting up table 'planet_osm_polygon'
2021-07-16 21:14:31  Setting up table 'planet_osm_roads'
2021-07-17 18:36:59  Reading input files done in 76948s (21h 22m 28s).        1.5/s)
2021-07-17 18:36:59    Processed 7041839672 nodes in 1231s (20m 31s) - 5720k/s
2021-07-17 18:36:59    Processed 782228827 ways in 52584s (14h 36m 24s) - 15k/s
2021-07-17 18:36:59    Processed 9054640 relations in 23133s (6h 25m 33s) - 391/s
2021-07-17 18:36:59  Starting pool with 4 threads.
2021-07-17 18:36:59  Dropping table 'planet_osm_nodes'
2021-07-17 18:36:59  Done postprocessing on table 'planet_osm_nodes' in 0s
2021-07-17 18:36:59  Dropping table 'planet_osm_ways'
2021-07-17 18:37:02  Done postprocessing on table 'planet_osm_ways' in 3s
2021-07-17 18:37:02  Dropping table 'planet_osm_rels'
2021-07-17 18:37:02  Done postprocessing on table 'planet_osm_rels' in 0s
2021-07-17 18:37:02  Clustering table 'planet_osm_point' by geometry...
2021-07-17 18:37:02  Clustering table 'planet_osm_line' by geometry...
2021-07-17 18:37:02  Clustering table 'planet_osm_roads' by geometry...
2021-07-17 18:37:02  Clustering table 'planet_osm_polygon' by geometry...
2021-07-17 18:37:02  Using native order for clustering table 'planet_osm_line'
2021-07-17 18:37:02  Using native order for clustering table 'planet_osm_point'
2021-07-17 18:37:02  Using native order for clustering table 'planet_osm_polygon'
2021-07-17 18:37:02  Using native order for clustering table 'planet_osm_roads'
2021-07-17 18:52:50  Creating geometry index on table 'planet_osm_roads'...
2021-07-17 18:58:21  Analyzing table 'planet_osm_roads'...
2021-07-17 18:58:35  All postprocessing on table 'planet_osm_roads' done in 1293s (21m 33s).
2021-07-17 19:18:06  Creating geometry index on table 'planet_osm_point'...
2021-07-17 19:57:31  Analyzing table 'planet_osm_point'...
2021-07-17 19:57:47  All postprocessing on table 'planet_osm_point' done in 4845s (1h 20m 45s).

when this is done, i will retry some different options to pin down which options caused this issue.

verfriemelt-dot-org commented 3 years ago

ok, it is done now, but there must be an error in memory reporting:

2021-07-18 00:25:08  node cache: stored: 2423544246(34.42%), storage efficiency: 92.45% (dense blocks: 292219, sparse nodes: 113793373), hit rate: 33.45%
2021-07-18 00:25:09  Overall memory usage: peak=90362MByte current=1751MByte
2021-07-18 00:25:09  osm2pgsql took 97840s (27h 10m 40s) overall.

this system has only 32GB an no swap at all. how could it possibly used 90gb peak?

joto commented 3 years ago

Those values report the virtual memory used, not the actual memory used. The flat nodes file contributes about 65 GB to that value, with the 32 GB RAM used fully, a value of about 90 GB sounds about right. On modern systems "memory" is a rather complex topic...

verfriemelt-dot-org commented 3 years ago

with cache set to 0 as you suggested it runs just fine...

osm2pgsql --log-level=debug --hstore-all --slim --drop --cache=0 -G --number-processes=4 --flat-nodes=tmp -H 192.168.2.103 -U osm -d osm2 planet-latest.osm.pbf
2021-07-18 10:50:56  osm2pgsql version 1.5.0
2021-07-18 10:50:57  [0] Database version: 13.3 (Debian 13.3-1)
2021-07-18 10:50:57  [0] PostGIS version: 3.1
2021-07-18 10:50:57  [0] Reading file: planet-latest.osm.pbf
2021-07-18 10:50:57  [0] Started pool with 4 threads.
2021-07-18 10:50:57  [0] Loading persistent node cache from 'tmp'.
2021-07-18 10:50:57  [0] Mid: pgsql, cache=0
2021-07-18 10:50:57  [0] Setting up table 'planet_osm_nodes'
2021-07-18 10:50:57  [0] Setting up table 'planet_osm_ways'
2021-07-18 10:50:57  [0] Setting up table 'planet_osm_rels'
2021-07-18 10:50:57  [0] Using projection SRS 3857 (Spherical Mercator)
2021-07-18 10:50:57  [0] Using built-in tag transformations
2021-07-18 10:50:57  [0] Setting up table 'planet_osm_point'
2021-07-18 10:50:57  [0] Setting up table 'planet_osm_line'
2021-07-18 10:50:57  [0] Setting up table 'planet_osm_polygon'
2021-07-18 10:50:57  [0] Setting up table 'planet_osm_roads'
Processing: Node(6701520k 6131.3k/s) Way(0k 0.00k/s) Relation(0 0.0/s)^C

with set to 16000 it fails:

osm2pgsql --log-level=debug --hstore-all --slim --drop --cache=16000 -G --number-processes=4 --flat-nodes=tmp -H 192.168.2.103 -U osm -d osm2 planet-latest.osm.pbf
2021-07-18 11:09:15  osm2pgsql version 1.5.0
2021-07-18 11:09:15  [0] Database version: 13.3 (Debian 13.3-1)
2021-07-18 11:09:15  [0] PostGIS version: 3.1
2021-07-18 11:09:15  [0] Reading file: planet-latest.osm.pbf
2021-07-18 11:09:15  [0] Started pool with 4 threads.
2021-07-18 11:09:15  [0] Loading persistent node cache from 'tmp'.
2021-07-18 11:09:15  [0] Mid: pgsql, cache=16000
2021-07-18 11:09:15  [0] Setting up table 'planet_osm_nodes'
2021-07-18 11:09:15  [0] Setting up table 'planet_osm_ways'
2021-07-18 11:09:20  [0] Setting up table 'planet_osm_rels'
2021-07-18 11:09:21  [0] Using projection SRS 3857 (Spherical Mercator)
2021-07-18 11:09:21  [0] Using built-in tag transformations
2021-07-18 11:09:21  [0] Setting up table 'planet_osm_point'
2021-07-18 11:09:22  [0] Setting up table 'planet_osm_line'
2021-07-18 11:09:22  [0] Setting up table 'planet_osm_polygon'
2021-07-18 11:09:22  [0] Setting up table 'planet_osm_roads'
Processing: Node(2111760k 5373.4k/s) Way(0k 0.00k/s) Relation(0 0.0/s)[1]    3622294 killed osm2pgsql --log-level=debug --hstore-all --slim --drop

that sudden spike in memory consumptions halts the progress for about 4-5 seconds until it gets killed.

with the cache set to its default ( 800 ) it does not fail.

11:20:17 » easteregg@t480 ~/osm-import $ osm2pgsql --log-level=debug --hstore-all --slim --drop --cache=800 -G --number-processes=4 --flat-nodes=tmp -H 192.168.2.103 -U osm -d osm2 planet-latest.osm.pbf
2021-07-18 11:20:27  osm2pgsql version 1.5.0
2021-07-18 11:20:27  [0] Database version: 13.3 (Debian 13.3-1)
2021-07-18 11:20:27  [0] PostGIS version: 3.1
2021-07-18 11:20:27  [0] Reading file: planet-latest.osm.pbf
2021-07-18 11:20:27  [0] Started pool with 4 threads.
2021-07-18 11:20:27  [0] Loading persistent node cache from 'tmp'.
2021-07-18 11:20:27  [0] Mid: pgsql, cache=800
2021-07-18 11:20:27  [0] Setting up table 'planet_osm_nodes'
2021-07-18 11:20:27  [0] Setting up table 'planet_osm_ways'
2021-07-18 11:20:28  [0] Setting up table 'planet_osm_rels'
2021-07-18 11:20:28  [0] Using projection SRS 3857 (Spherical Mercator)
2021-07-18 11:20:28  [0] Using built-in tag transformations
2021-07-18 11:20:28  [0] Setting up table 'planet_osm_point'
2021-07-18 11:20:28  [0] Setting up table 'planet_osm_line'
2021-07-18 11:20:28  [0] Setting up table 'planet_osm_polygon'
2021-07-18 11:20:28  [0] Setting up table 'planet_osm_roads'
Processing: Node(3302180k 5185.1k/s) Way(0k 0.00k/s) Relation(0 0.0/s)^C

thank you for your effort :) really appreciate the work you guys put in the osm project!

lonvia commented 3 years ago

What is your shared_buffers setting in Postgresql?

verfriemelt-dot-org commented 3 years ago

postgresql 13.3 is running on a different machine with 16GB memory and is configured quite conservative:

synchronous_commit = off
random_page_cost = 1
max_locks_per_transaction = 1124
shared_buffers = 2GB
maintenance_work_mem = 256MB
max_parallel_maintenance_workers = 8
checkpoint_completion_target = 0.9
wal_buffers = 16MB
work_mem = 128MB
min_wal_size = 110MB
max_wal_size = 1GB
max_worker_processes = 20
max_logical_replication_workers = 10
max_parallel_workers_per_gather = 12
max_parallel_workers = 12
effective_cache_size = 12GB
checkpoint_timeout = 5min
wal_level = logical
max_sync_workers_per_subscription = 10
wal_compression = on
max_wal_size = 16GB
effective_io_concurrency = 200

joto commented 1 year ago

Nothing to do here any more, so closing. For the issue of cache settings, see #1501.

osm2pgsql-dev / osm2pgsql

possible memoryleak with 1.5.0 #1539