marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
161 stars 45 forks source link

Add Support for Parquet File Storage Backend #50

Open JasonMoho opened 3 years ago

JasonMoho commented 3 years ago

Is your feature request related to a problem? Please describe. Marius currently supports the following backends for storing parameters and training data:

Parquet files are commonly used for handling large amounts of data. Currently, if a user has a large amount of training data (edges) that is stored in a parquet file, they will have to convert the file into the flat file format. This conversion process is handled as a preprocessing step and will likely require the data to be copied.

Describe the solution you'd like To avoid unnecessary copies of large amounts of data and expensive preprocessing. We should support a parquet file backend directly using https://github.com/apache/parquet-cpp. https://github.com/apache/arrow.

Describe alternatives you've considered A preprocessor step can be written which converts the input Parquet file into the file format required by the FlatFile backend.

Additional context This will add an additional dependency on spark to the system (This could be a heavy dependency). We should make this dependency optional as not all users will be operating with parquet files.

Parquet-cpp has merged with https://github.com/apache/arrow. So we can use that instead.

JasonMoho commented 3 years ago

Arrow is a heavy dependency. There is no clear way to get it working as a git submodule or a cmake external project without introducing a bunch of other dependencies. It requires boost (even though they claim it doesn't https://arrow.apache.org/blog/2020/07/29/cpp-build-simplification/), which we purposefully removed as a dependency in #37.

The brew install for arrow also fails...

We either bite the bullet and introduce a bunch of dependencies, or we explore alternative methods for supporting parquet files, such as adding a preprocessing step which converts the parquet files into the FlatFile format currently supported by Marius. I'm leaning towards the latter, even though it will incur a copy of the data, it is far simpler to implement.

This conversion can be done using pyarrow, for which the pip install works just fine.

shivaram commented 3 years ago

I think its fine to add a pre-processor for now and revisit this later once Arrow doesn't require boost etc.

Mistobaan commented 2 years ago

You could also impose a specific parquet format a write a very simple parser for that specific format. I don't think you need the full power of arrow in this project or using a full python library to exercise the api. (i.e. using https://github.com/dask/fastparquet/ )

thodrek commented 2 years ago

@Mistobaan indeed this is a great suggestion!