uzh / signal-collect

A framework for scalable graph computing.
www.signalcollect.com
Apache License 2.0
148 stars 32 forks source link

in-memory or on-disc or both #145

Closed karussell closed 11 years ago

karussell commented 11 years ago

I didn't find information about this. Except that the signal-collect is using kryo as serialization library. So will signal-collect work only if I have enough RAM?

pstutz commented 11 years ago

Hi Peter, that's right, S/C works only in RAM. Kind regards, Philip

karussell commented 11 years ago

Thanks!

pstutz commented 11 years ago

New issue #146.

karussell commented 11 years ago

How much RAM will I need for 1 mio nodes + edges? Assume a simple road graph with lat,lon point connecting by an edge (road with speed value)?

pstutz commented 11 years ago

The default implementation will use a lot of memory, because each edge is represented as an object. This class here (https://github.com/uzh/signal-collect/blob/master/src/main/scala/com/signalcollect/examples/EfficientPageRank.scala) uses a more efficient representation (stores only edge id, you would additionally need the speed value, but that should be easy to compress, as there are very few different speed values). The ids are sorted, delta encoded and stored as variable length integers. With this encoding schema we were able to load 1.4 billion vertices and 6.6 billion edges on 4 machines with 66GB each (we did not use all of that). That works out to less than 200 bytes per vertex for this use case. If we assume that your use case has a similar vertex/edge ratio and requires twice as much memory per vertex, then that means you would need approximately 400 MB of RAM. That is with efficient encoding though, the default implementation would surely use a lot more than that.

pstutz commented 11 years ago

I just realised that the lat/long coordinates are probably floats. Given that, I think that this encoding scheme would be a lot less efficient and the edges would require more memory (2 floats => 64 bits) than I assumed above.

karussell commented 11 years ago

Thanks for the insights! When you speak about

The ids are sorted, delta encoded and stored as variable length integers.

are you able to randomly access edges? Or is it only possible for nodes where you lookup all associated edges? Or do you have another index which maps edge ids to edge-store file position?

pstutz commented 11 years ago

Outgoing edges are not usually manually accessed in the default implementation. They are stored and accessible in the outgoingEdges map though. When using the special compressed encoding one can only iterate over all edges. It might make sense to start with the default implementation and optimise later, if memory usage is a concern.