Closed karussell closed 11 years ago
Hi Peter, that's right, S/C works only in RAM. Kind regards, Philip
Thanks!
New issue #146.
How much RAM will I need for 1 mio nodes + edges? Assume a simple road graph with lat,lon point connecting by an edge (road with speed value)?
The default implementation will use a lot of memory, because each edge is represented as an object. This class here (https://github.com/uzh/signal-collect/blob/master/src/main/scala/com/signalcollect/examples/EfficientPageRank.scala) uses a more efficient representation (stores only edge id, you would additionally need the speed value, but that should be easy to compress, as there are very few different speed values). The ids are sorted, delta encoded and stored as variable length integers. With this encoding schema we were able to load 1.4 billion vertices and 6.6 billion edges on 4 machines with 66GB each (we did not use all of that). That works out to less than 200 bytes per vertex for this use case. If we assume that your use case has a similar vertex/edge ratio and requires twice as much memory per vertex, then that means you would need approximately 400 MB of RAM. That is with efficient encoding though, the default implementation would surely use a lot more than that.
I just realised that the lat/long coordinates are probably floats. Given that, I think that this encoding scheme would be a lot less efficient and the edges would require more memory (2 floats => 64 bits) than I assumed above.
Thanks for the insights! When you speak about
The ids are sorted, delta encoded and stored as variable length integers.
are you able to randomly access edges? Or is it only possible for nodes where you lookup all associated edges? Or do you have another index which maps edge ids to edge-store file position?
Outgoing edges are not usually manually accessed in the default implementation. They are stored and accessible in the outgoingEdges
map though. When using the special compressed encoding one can only iterate over all edges. It might make sense to start with the default implementation and optimise later, if memory usage is a concern.
I didn't find information about this. Except that the signal-collect is using kryo as serialization library. So will signal-collect work only if I have enough RAM?