Closed GoogleCodeExporter closed 8 years ago
I've made some changes to reduce memory fragmentation which helps a little. The
main user of memory is the lookup table to convert node ids to lat/lon when
reading the ways.
Currently all the nodes in the osm file are read to memory (a std::vector of
{int id; long x,y;}). This takes 12 bytes per node. France has 122 million
nodes and this works.out at c.1.5 GB). It is reallocating this array where
preprocessing first runs out of memory. I can improve things a little by
reserve() ing the space for these nodes, and France then compiles on my 2GB
machine.
Other space optimisations could be:
Read the file in two passes, work out which nodes are needed in the first pass, and then load these nodes only in the 2nd. The current drawing code uses about 80% of the nodes though so savings would be modest.
As above, but allow processing in batches. eg. Pass 1 read first 1M ways. Pass 2 load nodes for these ways and write ways to temp file. Repeat until all ways read. Then load all ways from temp file, sort, index, and write final db.
I think the second of these is the only way to read an arbitrarily large osm
file (eg europe or planet), but it would be slow traversing all nodes multiple
times.
Any thoughts?
Original comment by jamesh...@gmail.com
on 14 Feb 2011 at 3:49
This seems to help, the preprocessing gets further than before, but does not
finish , yet.
[11:05:43][0] 15800000 ways with 133634762 nodes
is as far as it goes.
Besides implementing a multi-pass approach you could try:
- Store only IDs of nodes for each way instead of coordinates and do the coordinate lookup on-thy-fly. When splitting ways you have to insert new nodes.
- Use a vector that uses realloc for PODS instead of allocate + copy ( e.g. QVector if I am not mistaken ). STL vectors have some bad design choices ( resize not using realloc, resize doubling the vector size, no shrink, clear not freeing memory etc ). Manual resizing might also squeeze out some more memory ( my dynamic graph in the Contraction Hierarchies plugin does this; it checks whether enough space is reserved and if not only enlarges by a small factor, e.g. 1.2, instead of 2 )
The multi-pass approach might be nice have have in the future, however, since
you have to load all ways into memory at some point your memory savings are
limited,
Original comment by veaac.fd...@gmail.com
on 15 Feb 2011 at 11:05
What do you mean by doing "coordinate lookup on-the-fly". The only efficient
way I could see of going from node-id --> coordinate is to look it up in an
in-memory std::map or equivalent. It could be looked up from disk, but that
would be really slow (worst case scenario polling the osm.bz2 file, or a bit
faster from a custom generated file). Am I missing something obvious?
As far as doing a multi-pass approach goes it wouldn't ever be necessary to
load all ways into memory. They can be processed one by one, and then
bucket-sorted to one of a number of temporary files each of which will fit into
memory for sorting and indexing. It would be slow though.
Original comment by jamesh...@gmail.com
on 15 Feb 2011 at 2:24
With on-the-fly lookup I mean the following:
Right now you store for each way its list of coordinates in the global node
array for all ways. Instead you could just store the nodes' ID and later when
you want to access a node's coordinates you perform the node-lookup with the
binary search.
Of course, this slows down the whole process if you have to access the way's
coordinates more than once.
By the way, the binary search is way more efficient then a std::map, as the
std::map is internally a fully fledged self-balancing binary tree. This usually
means an overhead of at least 8 bytes per entry.
Original comment by veaac.fd...@gmail.com
on 15 Feb 2011 at 2:31
This issue was closed by revision f4c62ed6fc75.
Original comment by jamesh...@gmail.com
on 21 May 2011 at 2:12
It turns out none of the more minor changes you suggested allowed a large
enough increase in file size, so I've implemented a multi-pass preprocessor. It
scales poorly with increasing file size, but can handle France easeily, and
even the whole of Europe in under 2h with 2GB of RAM which should be OK.
N.B. Processing the whole of Europe isn't actually useful at the moment as the
resulting database file size (>4GB) breaks some assumptions in the client code,
and would need a new DB format to fix.
Original comment by jamesh...@gmail.com
on 21 May 2011 at 2:17
Ah, I just hit this issue while processing Europe. The last comment shows that
it's a known issue. Maybe make a new bug out of it?
Original comment by mailsan...@gmx.net
on 18 Jul 2011 at 12:51
Yeah, just go ahead and file a new bug report for that, I'll assign it to James.
Original comment by veaac.fd...@gmail.com
on 18 Jul 2011 at 1:01
Original issue reported on code.google.com by
veaac.fd...@gmail.com
on 10 Feb 2011 at 4:53