marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

Pyspark preprocessor outputs to s3 #124

Open basavaraj29 opened 1 year ago

basavaraj29 commented 1 year ago

the preprocessor now writes processed edge and node data to s3, but the data is split into many files. need to combine them.

the following errors when the files are small,

s3_obj.merge(output_filename, files_list)

throws the error EntityTooSmall.

once we have a single file, we can look into converting that to binary. Alternatively, we can define a custom writer that outputs in binary format without the intermediate csv files.