graphframes / graphframes

http://graphframes.github.io/graphframes
Apache License 2.0
990 stars 233 forks source link

Make edgeStorageLevel and vertexStorageLevel configurable #219

Open estebandonato opened 7 years ago

estebandonato commented 7 years ago

Most of the GraphFrame algorithms are wrappers of GraphX algorithm implementation. That's the case of PageRank, LabelPropagation, StronglyConnectedComponents, etc. It turns out that all these algorithms use the Pregel api implemented in GraphX. Internally, Pregel caches the vertex and edge RDDs on each iteration. The storage levels used by Pregel to cache the vertices and edges are the ones passed to Graph.apply method when a graphX instance is created. The default storage level for both vertex and edge RDDs is MEMORY_ONLY as we can see it in Spark code:

def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]

GraphFrames algorithms convert a GraphFrame instance to a GraphX instance before calling to the graphX algorithms but using the default Storage Level. This might not be a good choice in some environments of limited RAM.

My proposal is to make both edgeStorageLevel and vertexStorageLevel configurable per algorithm. Internally this would create a graphX instance with the configured storage levels before passing the instance to Pregel. This is somehow similar to feature #213 but for all the algorithms that are wrappers of a graphX implementation.

We are needing this feature so if you give me green light I can work on it.

thunterdb commented 7 years ago

@estebandonato that sounds like a great idea. Feel free to submit a PR along the lines of what was done in #213 .

felixcheung commented 7 years ago

do you have a specific need for vertex and edge at different storage level?

estebandonato commented 7 years ago

@felixcheung no, I usually use the same storage level for both vertex and edge, however I would like to leave the api open to contemplate any combination. Thoughts?

estebandonato commented 7 years ago

Finally I have finished PR #225 which addresses this issue. Please review it and let me know your feedback.

estebandonato commented 7 years ago

@thunterdb @felixcheung I was wondering if any of you guys could review PR #225 that addresses this issue and provide feedback. Thanks!

felixcheung commented 7 years ago

yes! will do shortly. thanks for the patience.