utdemir / distributed-dataset

A distributed data processing framework in Haskell.
BSD 3-Clause "New" or "Revised" License
114 stars 5 forks source link

Introduce Distributed Parquet Reader #26

Open yigitozkavci opened 4 years ago

yigitozkavci commented 4 years ago

This is a WIP PR. Currently https://github.com/yigitozkavci/parquet-hs should be pulled locally because parquet-hs hasn't been uploaded to Hackage yet.

Running the example using Nix:

# While inside distributed-dataset directory
$ git clone https://github.com/yigitozkavci/parquet-hs ../parquet-hs

# Start the nix shell
$ nix-shell

# This example reads parquet data from https://yigitozkavci-dd-test-bucket.s3.amazonaws.com/test.parquet. Data is being streamed remotely.
$ cabal new-run example-parquet

...<metadata here>
[Info] Stages: SInit @ParquetValue 1
[Info] Running: SInit @ParquetValue 1
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 0),("nested",ParquetNull),("some_str",ParquetString "zero")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 1),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetNull)]))),("some_str",ParquetString "one")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 2),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "two")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 3),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "three")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 4),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 4)])))]))),("some_str",ParquetString "four")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 5),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 4)])))]))),("some_str",ParquetString "five")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 6),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "six")]))
utdemir commented 4 years ago

This looks great! I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example.

I guess most parquet files in the wild will be compressed. How hard do you think it'd be to implement at least one compression algorithm? We can start with either snappy or gzip initially. I'm happy to give you a hand on parquet-hs library if you don't want to work on it. By the way, I'm also happy to merge this without compression support.

There's some formatting changes we might want to do here (main codebase uses Ormolu, it'd be good to use it here), and we can download the parquet-hs from github via Nix instead of depending on a local clone. But after that I don't see a reason keep it in a WIP PR, whole project is a WIP right now anyway :).

yigitozkavci commented 4 years ago

I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example.

That would be really nice. I'm currently generating parquet files for testing using this script and I believe this might be fairly limited in terms of variety.

I guess most parquet files in the wild will be compressed. How hard do you think it'd be to implement at least one compression algorithm?

This is one of the reasons of not uploading parquet-hs to Hackage yet, really. It shouldn't be too hard, this is my next priority when I get to work on parquet-hs.

There's some formatting changes we might want to do here (main codebase uses Ormolu, it'd be good to use it here).

Sure! I will update the PR using ormolu. I was using brittany, I think, while formatting this.

utdemir commented 4 years ago

I found Amazon Customer Reviews Dataset, which is provided as partitioned, snappy compressed parquet files on S3. It's around 50GB in total (compressed).

I created https://github.com/utdemir/distributed-dataset/issues/27 to gather public datasets we can use.