Open estebandonato opened 7 years ago
@estebandonato that sounds like a great idea. Feel free to submit a PR along the lines of what was done in #213 .
do you have a specific need for vertex and edge at different storage level?
@felixcheung no, I usually use the same storage level for both vertex and edge, however I would like to leave the api open to contemplate any combination. Thoughts?
Finally I have finished PR #225 which addresses this issue. Please review it and let me know your feedback.
@thunterdb @felixcheung I was wondering if any of you guys could review PR #225 that addresses this issue and provide feedback. Thanks!
yes! will do shortly. thanks for the patience.
Most of the GraphFrame algorithms are wrappers of GraphX algorithm implementation. That's the case of PageRank, LabelPropagation, StronglyConnectedComponents, etc. It turns out that all these algorithms use the Pregel api implemented in GraphX. Internally, Pregel caches the vertex and edge RDDs on each iteration. The storage levels used by Pregel to cache the vertices and edges are the ones passed to Graph.apply method when a graphX instance is created. The default storage level for both vertex and edge RDDs is MEMORY_ONLY as we can see it in Spark code:
GraphFrames algorithms convert a GraphFrame instance to a GraphX instance before calling to the graphX algorithms but using the default Storage Level. This might not be a good choice in some environments of limited RAM.
My proposal is to make both edgeStorageLevel and vertexStorageLevel configurable per algorithm. Internally this would create a graphX instance with the configured storage levels before passing the instance to Pregel. This is somehow similar to feature #213 but for all the algorithms that are wrappers of a graphX implementation.
We are needing this feature so if you give me green light I can work on it.