twitter / cassovary

Cassovary is a simple big graph processing library for the JVM
http://twitter.com/cassovary
Apache License 2.0
1.05k stars 150 forks source link

Allow node ids to be Long #66

Open pankajgupta opened 10 years ago

pankajgupta commented 10 years ago

This is related to https://github.com/twitter/cassovary/issues/36 but different. We want the ability of node ids to be Long. The use case is when storing a portion of a huge graph in memory on one machine. The nodes can be too many to be able to map to int. Even though the number of nodes on a single machine are small enough, they can refer to arbitrary remote nodes in their adjacency list.

This can be done by making Node and Graph to be of type[@specialized(Int, Long) V], making 'id' to be of type V and doing suitable refactoring. When the node is of type Long, a hash can be supplied to map node id to Node as required by getNodeById()

szymonm commented 10 years ago

I think, I don't fully understand this issue.

If we want to store a part of a larger graph that uses Long ids, why not to use NodeRenumbere[Long <-> Int] and store an Int graph with ids renumbered? You write that sometimes you want to store a part of a larger graph with edges to other parts of the graph, not stored on this machine and thus you need Long ids of nodes that are on different machine. Obviously you cannot use NodeRenumberer for them, because there can be too much of them. However, if we store edges, that connect nodes that are not in this graph, it's not a graph anymore. What's more, we never actually use this edges in our computations inside one machine, so maybe we should keep them outside the graph implementation? For example by having GraphWithOutsideEdges that is actually a triple of (Graph, NodeRenumberer[Long], outsideEdges: Array[Long]), we can denote something that behaves like a subgraph of Graph[Long], that performs computations only inside the subgraph, but stores also some outside edges.

Second thing, if we do #36, we gonna have a logic of a NodeRenumberer[Long <-> Int] in 2 places: node renumberer and hashes to use with getNodeById(). Or am I wrong?

pankajgupta commented 10 years ago

On May 16, 2014, at 2:24 AM, Szymon notifications@github.com wrote:

I think, I don't understand correctly this issue.

If we want to store a part of a larger graph that uses Long ids, why not to use NodeRenumbere[Long <-> Int] and store an Int graph with ids renumbered? You write that sometimes you want to store a part of a larger graph with edges to other parts of the graph, not stored on this machine and thus you need Long ids of nodes that are on different machine. Obviously you cannot use NodeRenumberer for them, because there can be too much of them. However, if we store edges, that connect nodes that are not in this graph, it's not a graph anymore. What's more, we never actually use this edges in our computations inside one machine, so maybe we should keep them outside the graph implementation? For example by having GraphWithOutsideEdges that is actually a triple of (Graph, NodeRenumberer[Long], outsideEdges: Array[Long]), we can denote something that behaves like a subgraph of Graph[Long], that performs computations only inside the subgraph, but stores also some outside edges.

Well, it is still a graph. Just that some nodes either do not have any edges, or have edges outside this machine (this is decided by some higher level logic, presumably)

The triple is an interesting idea. But it is less straightforward than just having long node ids. So I wonder if there is a big benefit here compared to just keeping Long node ids for all nodes in this case?

Second thing, if we do #36, we gonna have a logic of a NodeRenumberer[Long <-> Int] in 2 places: node renumberer and hashes to use with getNodeById(). Or am I wrong?

I am not sure I understand.

— Reply to this email directly or view it on GitHub.

szymonm commented 10 years ago

I assume that a graph will be still an array of Node (+ some additional info like nodeCount...). We want to store all nodes with ids of type Long in the array, so we have to use something like NodeRenumberer inside the graph to map Long ids to Int. This is what you mean hashing? If not, how do you expect getNodeById to work?

pankajgupta commented 10 years ago

We could do it that way with similar logic to Renumberer that maps Long to Int which is then looked up in the array. If you do that way, yes, it would make sense to share the main logic with Renumberer.

You could also have the graph to directly be the hashtable of Nodes (not array) where key is Long id, and value is the Node of each hash entry. Something other than a hash (such as a trie) might be more performant instead, so it would be nice to structure the code to allow easy variations like that.

szymonm commented 10 years ago

What about space consumptions and overhead problems when we use HashTable? They are not specialized. Using Renumberer does not solve that problem, as we use HashTable in it.

pankajgupta commented 10 years ago

yeah, HashTable in Renumberer is not that great. But we will need benchmarks to really know the performance/space penalty.

pankajgupta commented 10 years ago

After email discussion, it was decided to not change the current interface to make Long ids, and instead introduce a completely new Graph interface. The issue remains open to make this new interface :)

szymonm commented 9 years ago

The priority of this change is to allow generic ids with Int, Long specialization while maintaining as much of the current performance as possible.

While it is relatively easy to generalize ids, it is not that easy utilize specialization and keep the current performance due to the following reasons: 1) Scala's collections are not specialized, 2) fastutil collections are hardly interoperable with Scala's.

There are 3 possible ways I can think of:

1) Sacriface speed and memory for readability and use only Scala's collections. Use specialization, where it is possible and try to use http://scala-miniboxing.org/ to do automatic specialization.

2) To overcome 1) we could use https://github.com/non/debox/. The library is, however, not stable yet and depends on Spire (that is quite big itself). And still is probably slower and uses more space than fastutils (could not find reliable coparison).

3) To overcome both 1) and 2) we could introduce our own specialized interfaces for basic datatypes we use (seq, map) and use wrapers over fastutils implementations. Cons of this solution are code duplication (with Scala's interfaces) and need to convert our collections to Scala's when returning them to the user.

Any thoughts, @pankajgupta ??

szymonm commented 9 years ago

For now, I'm moving towards 1) while trying to keep current usage patterns of fastutils where possible.

pankajgupta commented 9 years ago

Yes, 1) seems best for now.

On Thu, Sep 17, 2015 at 5:04 AM, Szymon notifications@github.com wrote:

For now, I'm moving towards 1) while trying to keep current usage patterns of fastutils where possible.

— Reply to this email directly or view it on GitHub https://github.com/twitter/cassovary/issues/66#issuecomment-141056715.

szymonm commented 9 years ago

Turns out generalization will be complicated even if we don't use fastutils. In few places we use int ordering and utils from java.util.Arrays (see for instance DirectedPath, DirectedGraph). We need to have Id <: Ordered. Ordered is again not specialized in Scala...

What do you think of using Spire, @pankajgupta ? We get specialized Order and solve problems with arrays (https://github.com/non/spire/blob/master/core/shared/src/main/scala/spire/std/array.scala)

pankajgupta commented 9 years ago

I don't think we want dependency on Spire which seems too big and written for different purpose.

Would be great to supply our own implementations where needed. Didn't find java.util.Arrays used in DirectedGraph, in DirectedPath we can easily write our own code. Regarding Ordering/Ordered, what's the problem exactly -- that it will use autoboxing when used?

On Fri, Sep 18, 2015 at 9:47 AM, Szymon notifications@github.com wrote:

Turns out generalization will be complicated even if we don't use fastutils. In few places we use int ordering and utils from java.util.Arrays (see for instance DirectedPath, DirectedGraph). We need to have Id <: Ordered. Ordered is again not specialized in Scala...

What do you think of using Spire, @pankajgupta https://github.com/pankajgupta ? We get specialized Order and solve problems with arrays ( https://github.com/non/spire/blob/master/core/shared/src/main/scala/spire/std/array.scala )

— Reply to this email directly or view it on GitHub https://github.com/twitter/cassovary/issues/66#issuecomment-141504930.

szymonm commented 9 years ago

1) It's not that easy to supply our own implementations in many places. Consider Arrays.equals. We have id type @specialized(Int, Long) Id, that we don't want to box for Int, Long and don't care about the performance for other types. Currently, we use Arrays.equals(int[], int[]) that is basically: check for all elements equality using ==, which is non boxed equality for Int. The same is with Long. For Id (generic type) we need another approach. Here we need to use method equals for elements equality. We could not use something like:

def ourEquals[T: TypeTag](a: Array[T], b: Array[T]) { T match {
   case Int => ....
   case Long => ....
   case AnyRef => ....
   }
}

because checking type using reflection (TypeTag) will be much more expensive than boxing. The only solution I found to solve this problem is specializing ourEquals using implicit conversions as described here: http://stackoverflow.com/questions/29468341/how-can-one-provide-manually-specialized-implementations-with-scala-specializati This looks a bit complex and out of scope for cassovary. And this problem is already solved in spire: they introduce their own specialized trait Eq https://github.com/non/spire/blob/master/core/shared/src/main/scala/spire/algebra/Eq.scala that we could use for Ids and use this equality.

2) We have similar problem with Ordered - we need Id <: Ordered[Id]. However, Ordered trait is not specialized in Scala. Then, every time compareTo is called Ints will be boxed. Because of this, we needed to write manually compare methods in VisitsComparator and PrevNbrComparator and not just use Ordering[Int, Int].on(infoMap.get(_)). With generic Ids, we need ordering in many more places. For this reasons spire uses it's own trait Order (https://github.com/non/spire/blob/master/core/shared/src/main/scala/spire/algebra/Order.scala) that is specialized.

Similar concepts are used in few other places: https://github.com/storm-enroute/reactive-collections -- they want specialized collections, so they use own order type https://github.com/storm-enroute/reactive-collections/blob/master/reactive-collections-core/src/main/scala/scala/reactive/Order.scala https://github.com/rklaehn/abc -- also specialized collections but use spire for that. https://github.com/non/cats -- use spire for specialized collections.

pankajgupta commented 9 years ago

A radical thought which I haven't thought through fully -- perhaps switch to using Long ids everywhere, however work the code so that it is optimized for ints when node ids do fit in an int. So:

  1. Give up on non-numeric ids. I think this is okay. Everything else can be (perhaps should be) managed by labels.
  2. All interfaces e.g., getNodeById etc. use Long ids.
  3. The fact that internally ints are being used can be reduced to be an "implementation detail". e.g., the internal representation could be whatever but what is returned by an interface is a Long when node id is desired, and Seq[Long](or Array[Long]) when edges are desired.
  4. To start with we can even ask in the graph constructor what the max id is, and if it fits in an int -- use int internally. This could perhaps be easily deduced as well.

On Mon, Sep 21, 2015 at 12:12 AM, Szymon notifications@github.com wrote:

1) It's not that easy to supply our own implementations in many places. Consider Arrays.equals. We have id type @specialized https://github.com/specialized(Int, Long) Id, that we don't want to box for Int, Long and don't care about the performance for other types. Currently, we use Arrays.equals(int[], int[]) that is basically: check for all elements equality using ==, which is non boxed equality for Int. The same is with Long. For Id (generic type) we need another approach. Here we need to use method equals for elements equality. We could not use something like: def ourEqualsT: TypeTag { T match { case int => .... case long => .... case AnyRef => .... } } because checking type using reflection (TypeTag) will be much more expensive than boxing. The only solution I found to solve this problem is specializing ourEquals using implicit conversions as described here: http://stackoverflow.com/questions/29468341/how-can-one-provide-manually-specialized-implementations-with-scala-specializati This looks a bit complex and out of scope for cassovary. And this problem is already solved in spire: they introduce their own specialized trait Eq https://github.com/non/spire/blob/master/core/shared/src/main/scala/spire/algebra/Eq.scala that we could use for Ids and use this equality.

2) We have similar problem with Ordered - we need Id <: Ordered[Id]. However, Ordered trait is not specialized in Scala. Then, every time compareTo is called Ints will be boxed. Because of this, we needed to write manually compare methods in VisitsComparator and PrevNbrComparator and not just use Ordering[Int, Int].on(infoMap.get(_)). With generic Ids, we need ordering in many more places. For this reasons spire uses it's own trait Order ( https://github.com/non/spire/blob/master/core/shared/src/main/scala/spire/algebra/Order.scala) that is specialized.

Similar concepts are used in few other places: https://github.com/storm-enroute/reactive-collections -- they want specialized collections, so they use own order type https://github.com/storm-enroute/reactive-collections/blob/master/reactive-collections-core/src/main/scala/scala/reactive/Order.scala https://github.com/rklaehn/abc -- also specialized collections but use spire for that. https://github.com/non/cats -- use spire for specialized collections.

— Reply to this email directly or view it on GitHub https://github.com/twitter/cassovary/issues/66#issuecomment-141896042.

szymonm commented 9 years ago
  1. We have boxing problems also with only-numeric ids. If we have @specialized(Int, Long) Id <: Numeric[Id] , we have Numeric not specialized and again the same problems. The only easy way to go is using either Int or Long as ids.

Your idea with using Long ids everywhere outside and Ints only for storing seems nice. There still needs to be a place when we convert ints to long, however. I don't see a clear way to handle that.

I'm starting to believe that fighting with boxing is generally hard (even using hacks) and separate from Cassovary and Spire is the best way for now. For the future we should observe improvements to scala-miniboxing and wait for some good implementation of specialized collections.

szymonm commented 9 years ago

I did some prototype here: https://github.com/szymonm/scala-fastutils/blob/master/benchmark/src/main/scala/Map.scala

Basically, I defined my own specialized Map implementation - FastMap. It is generic but provides basic specialized operations; for ints is backed by fastutils.Int2IntHashMap and can be converted to scala.mutable.Map (then operations are boxed). You can create it using FastMap[Int, Int]().

In the test below I obtained similar performance to fastutils map and 10x improvement comparing to scala.mutable.Map.

It wasn't that hard. Maybe imlementing our own Equality, Ordering, Option?, Seq, Map in the similar way is the way to go.

pankajgupta commented 9 years ago

On Tue, Sep 22, 2015 at 6:39 AM, Szymon notifications@github.com wrote:

  1. We have boxing problems also with only-numeric ids. If we have @specialized https://github.com/specialized(Int, Long) Id <: Numeric[Id] , we have Numeric not specialized and again the same problems. The only easy way to go is using either Int or Long as ids.

Your idea with using Long ids everywhere outside and Ints only for storing seems nice. There still needs to be a place when we convert ints to long, however. I don't see a clear way to handle that.

  • when returning ArrayLong http://for%20example%20by%20Node.neighbors() that is backed by an Array[Int] we will have to convert Array[Int] to Array[Long] that will be O(degree)... Alternatively we could return OurArray with API same as Array[Long] that will wrap the array of integers and convert every int when retrieving it. The problem here is that we will return to the user our own type not conforming with the Scala's collections. This can be confusing I'm afraid (because we cannot subclass Array[Long]).

Agreed that this is an issue. To both this and next issue, I'm wondering if the best might be to make our own class similar to this https://github.com/scala/scala/blob/v2.11.7/src/library/scala/collection/mutable/WrappedArray.scala#L1

which supports seamless Int <-> Long conversion and feels like a Seq so that the usual collection operators are supported. This will imply boxing/unboxing in the API but I am OK with it as it is what we have even right now because we use Seq.

  • This will also be confusing for our algorithms. We don't want to convert neighbors list on the fly when using in traversers. Thus, we will need to pass either Arrray[Int] or Array[Int] to traversers and have a similar switches as in Graph (i.e. holding either Queue of Long or Queue of Int) in many places in our code.

We will generally pass Longs everywhere and only invoke the switch to using Int's when we really need to. So in traverser, we would hold a Queue of Long.

I'm starting to believe that fighting with boxing is generally hard (even using hacks) and separate from Cassovary and Spire is the best way for now. For the future we should observe improvements to scala-miniboxing and wait for some good implementation of specialized collections.

Agreed, the above is a non-trivial code refactoring undertaking so pls take it on only if you feel up to it. I do think that it will be really cool to be general and still support efficient storage.

— Reply to this email directly or view it on GitHub https://github.com/twitter/cassovary/issues/66#issuecomment-142291975.

pankajgupta commented 9 years ago

Nice!

btw, if you write your own collection, be sure to take at least one read of http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html

Some design patterns mentioned here might be useful

On Tue, Sep 22, 2015 at 9:30 AM, Szymon notifications@github.com wrote:

I did some prototype here:

https://github.com/szymonm/scala-fastutils/blob/master/benchmark/src/main/scala/Map.scala

Basically, I defined my own specialized Map implementation - FastMap. It is generic but provides basic specialized operations; for ints is backed by fastutils.Int2IntHashMap and can be converted to scala.mutable.Map (then operations are boxed). You can create it using FastMapInt, Int.

In the test below I obtained similar performance to fastutils map and 10x improvement comparing to scala.mutable.Map.

It wasn't that hard. Maybe imlementing our own Equality, Ordering, Option?, Seq, Map in the similar way is the way to go.

— Reply to this email directly or view it on GitHub https://github.com/twitter/cassovary/issues/66#issuecomment-142341315.