This PR removes pyarrow as a required dependency and the option to use pyspark for preprocessing of the built-in datasets. This fixes some breaking changes introduced in #109 and #106.
Pyarrow:
Currently, it breaks the pip install on some linux machines due to a Cython requirement.
It is only needed if exporting embeddings to parquets using Pandas. Pandas does not require pyarrow and instead lets users decide whether to use Pyarrow or fastparquet when working with parquet files. We can follow this same behavior
Pyspark:
Pyspark was previously changed to be an optional dependency, however, this change broke preprocessing with our built-in datasets with the default pip install.
Imports of the built-in datasets without pyspark installed throw an import error
We don't need to use pyspark for our built-in datasets, as they can be assumed to fit on a single machine.
If users want to perform preprocessing with spark, they can call the SparkEdgeListConverter from a python script, or use the command line preprocessor for custom datasets.
This PR removes pyarrow as a required dependency and the option to use pyspark for preprocessing of the built-in datasets. This fixes some breaking changes introduced in #109 and #106.
Pyarrow:
Pyspark: