Open mengxr opened 8 years ago
Is there some work going on on this issue?
I started writing Java examples and load some text files to RDDs in order to convert into DataFrames. This functionality can also be used as a base for a GFR/GFW module.
I hope that it is not a problem to have such a component implemented in Java for the beginning. Later on we would have a Scala and Python version of it anyway.
My initial approach is based on a "GraphFrame-Descriptor". Such a descriptor knows all metadata for loading the edges or nodes and edges from arbitrary sources. The first implementation can simply work on Hive tables which are useful to abstract the low level storage details, such as avro or parquet format and partitions.
I do not really know, what should be posted first, a hack as pull request or a design document? Because of this, I start with a case study and implement the loader and writer in Java to have a starting point for deeper discussions.
They should have similar APIs to DataFrameReader/Writer, supporting different data sources with Parquet being the default.