[Lake][I/O] Techspike - Middleware for parquet appending/mutating/other

Motivation

Parquet files work w/ partitioning, but you need to have a way about it. As we work towards building an incremental pipeline, we need an approach to append records rather than what we have today. Today, we simply append data and yield a new parquet file. As data grows, this i/o will become more expensive.

Towards a solution

Rather than creating new records, parquet/polars supports partitions. This works very similarly as a regular .parquet file, except you have file_partitions.parquet folder, w/ many files, that you can read natively by polars by pointing at the folder name and treating it like a regular file. Just like any other db, your partioning can impact performance.

To build the partitions effectively, perhaps we should look at some thin middleware like fastparquet or duckdb to handle the partitioning/bucketing.

Potential solutions

As we look to support enough tables + size on-disk, we'll need a solution sooner rather than later to avoid long disk I/O times.

First, Apache Arrow which is the standard for how polars is built, supports a broad approach to partitioning as described below. By using pyarrow.parquet.write_to_dataset

https://arrow.apache.org/cookbook/py/io.html#writing-partitioned-datasets https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html https://stackoverflow.com/questions/76121937/how-to-append-new-data-to-an-existing-parquet-file

Second, Fastparquet append/write w/ partitions: https://fastparquet.readthedocs.io/en/latest/quickstart.html#writing Polars read parquet w/ partitions: https://docs.pola.rs/py-polars/html/reference/api/polars.read_parquet.html

Another potential solution would be to use Duckdb. However, as far as I understand it, we might need to first save the data to duckdb (db in-memory, persistent storage on analytics.db file). Such that we append to DuckDB, and then write new parquet files. DuckDB looks very promising, but leaves gaps as to whether it will allow us to append/grow incrementally.

Duckdb partitioned writes also seem to implement append in the same as pyarrow.parquet (using overwrite rules). Please techspike to test/verify.

https://duckdb.org/docs/data/partitioning/hive_partitioning https://duckdb.org/docs/data/partitioning/partitioned_writes#overwriting

Near-parquet

Polars supports other data-formats and processes (like Delta Lake) that may be able to help us across all 2 stages of the pipeline.

Delta Lake - Upstream/Fetching. Where we need to join + mutate records. I.e. Database like, to yield bronze-tables.
Arrow/Duck - Downstream/Computing. Where we just need to process records incrementally and can append-to-end/rebuild, to yield silver (aggregate) and gold (summary) tables.

DoD

[ ] Rather than creating new parquet files as a way to append, implement partitioning, in such a way that is saves files in efficient chunks.
[ ] gql_data_factory_save_parquet() should be updated to benefit from append + partitioning
[ ] bronze_tables should only save records changed, not entire partitions.
[ ] ohlcv_factory.save_rawohlcv_file and any other parquet-component, should be updated to benefit from append + partitioning

I found a really interesting approach that uses a vector db written in rust that uses the arrow format as the base layer. Please prioritize tech spiking with that rather than with iceberg.

Motivation

Current polars-friendly solutions require spark.
Current non-polars friendly solutions require a db, or for us to write SQL

Benefits

it uses apache arrow FS (as reviewed w/ you - columnar, common FS in the spark ecosystem, etc...)
supports appending/upserting/etc
written in rust, in-process, serverless
polars plays really well w/ arrow
clean hooks for pandas... we might be able to add some cheap polars support. the team is already working towards adding polars support
est disk cost is roughly 3x compared to .parquet (2.5x duckDB .db format) but it's .arrow

lancedb as ETL

Gives you the ability to work on a columnar file format, with everything on-disk, in-process, that handles appending/upserts/etc... and because it's all arrow in the base layer, you get all the serialization benefits, and integration w/ the data ecosystem.

It does this by using an inverted indexer which is different than how everyone else is doing things... This specific architecture unlocks what we're looking for here. https://lancedb.github.io/lancedb/concepts/index_ivfpq/

Lancedb looks promising as it's all in-process, so there are no extra servers to run or infra (like competing spark solutions), duckdb, etc... As an example, rather than our db being saved in ".db" or some other db-format, everything is in .arrow files.

https://lancedb.github.io/lancedb/#open-source-and-cloud-solutions

Please look at the following blogpost to see how to query a lance db table w/ polars lazyframe.

https://blog.lancedb.com/lancedb-polars-2d5eb32a8aa3

DuckDB as ETL

DuckDB (and other columnar dbs, whether in-memory or in-process) are still looking like strong contenders. However, will require a mixture of SQL w/ python, and it's unlikely to provide clean polar connectors. This means that all pre-filtering of records should likely be done through a duck-db SQL query, to reduce in-memory requirements for compute/data jobs.

It's storage is fairly compressed and will continue to lead towards effective storage/compute. https://duckdb.org/internals/storage.html

100 GB of uncompressed CSV files into a DuckDB database file will require 25 GB of disk space, while loading 100 GB of Parquet files will require 120 GB of disk space.

Clickhouse as ETL

Beyond DuckDB, if an OSS cluster, self-hosted db solution to do ELT is needed, clickhouse (and others) can serve this purpose.

Database Technologies

1. DuckDB

Advantages:
- Embedded: Runs within our application, no need for a separate server.
- Easy to set up and use, great for development and testing. (We can reach directly to data and we do not need an extra testing setting)
- Good performance for analytical queries on moderate-sized datasets.
Disadvantages:
- Limited scalability for very large datasets or high-concurrency scenarios compared to distributed databases.
- Primarily designed for single-node setups.

2. ClickHouse ( I think this one is fit for us)

Advantages:
- High performance for analytical queries, even on very large datasets.
- Scalable, can handle petabytes of data with clusters.
- Supports real-time query processing and SQL features.
Disadvantages:
- More complex to set up and manage than embedded databases. (We will need DevOps operations for testing)
- Requires more resources (dedicated servers).

3. LanceDB

Advantages:
- Specialized for handling and querying vector data, I can not comment about it because I still think the data that developed for is not our case.
- Suitable for search engines and uses vectoral nodes
- Efficient similarity search for high-dimensional vectors.
Disadvantages:
- Being a newer technology, it might have a smaller community and fewer integrations compared to more established databases.
- Focused on vector data, which may be a limitation (I feel like our application requires extensive relational data processing)

4. CouchDB

Advantages:
- Ease of Use, has a HTTP API
- CouchDB is a NoSQL database, allowing for flexible schema design, which is great for applications with evolving data models.
- scalable
Disadvantages:
- it may not be as efficient as other databases designed explicitly for analytical workloads on columnar data.
- Not optimized for fast analytical queries on large datasets, especially compared to columnar databases.

5. MariaDB ColumnStore

Advantages:
- Offers a columnar storage engine
- Probably it has the largest SQL features among options
- It is MySQL at the end of the day
Disadvantages:
- Setup and Management
- Relative Performance.

6. Apache Kudu

Advantages:
- Fast Analytics on Fast Data
- Columnar Storage
- Integration: Integrates well with the Hadoop ecosystem, including tools like Apache Spark, Impala, and others for complex data processing pipelines.
Disadvantages:
- Operational Complexity
- I guess no one has a long experience with it

File Formats

1. Parquet

Advantages:
- Efficient columnar storage format, great for analytical workloads.
- Excellent compression and encoding schemes, reducing storage costs.
- Broadly supported by data processing tools and cloud platforms.
Disadvantages:
- Primarily read-optimized, making updates and writes more complex.

2. ORC

Advantages:
- Similar to Parquet in being an efficient columnar format with strong compression.
- Offers fine-grained indexing, which can improve query performance.
Disadvantages:
- Like Parquet, more suited to read-heavy workloads.

3. CSV or JSON

Advantages:
- Simple and widely supported, making it great for data interchange.
- Easy to read and write with most programming languages and tools.
Disadvantages:
- No built-in compression or optimization, leading to larger file sizes and slower queries on large datasets.
- Again, slow queries

Considerations for Scalability

ClickHouse is the most scalable (if we do not count Kudu) option among the databases discussed, suitable for growing data needs and high query volumes.
DuckDB offers convenient embedded analytics for mid-scale data or as a component of larger systems where distributed processing isn't needed.
LanceDBis specialized for vector and AI-related workloads, but I guess the data model that we consider is not suitable for this one.
For file formats, Parquet and ORC are more scalable in terms of storage efficiency and query performance on large datasets

I'm looking specifically for an arrow-based approach that has polars-native integration such that we can do e2e testing on top of simple, on-disk data, without having to resort to integrating w/ a local db.

I'm also looking for an in-process approach such that we don't have to pass down large memory and compute requirements to our users such that they require a cluster off-the-bat in order to support a basic filesystem + queries.

Clickhouse + other DBs require multiple servers to keep the FS + querying available. Arrow-based systems run on-top off S3. They only need whatever is querying to run the query. It's serverless.

This passes a much smaller set of requirements and scope down to our users.

LanceDB is built on-top of the arrow, columnar-based filesystem, as a base-level implementation it's a columnar database

The discussion is of whether we want to continue focus on a serverless, in-process, arrow-based system, that supports all the needs we have and can solve our problems at-scale, rather than a cluster-based approach.

[Further on Lance dovetailing with AI] Am also now realizing the benefits that a vector db w/ good random sampling can do for the ML and MPP, and how one doevetailing the other more and more...

[Reading from filesytem for training] Example... downstream, spark/cluster/data will have to read/get data to test the model sample(x_train,y_train), which is exactly what a vectordb is designed to do efficiently. For other db models (sql, couch), we'll either need an inefficient connector and because there are many things hitting the cluster, more and more provisioning.

With lancedb, provisioning is done at the process level (i.e. every predictoor reading from disk pays their part) rather than at the cluster level (shared service).

[Distributed computing and MPP] Due to arrow/provisioning for ML workloads, it's incredibly easy to sample/shuffle data for distributed workloads. Other dbs like redshift/etc... do not have this. All of this data also needs to exist on disk (i.e. S3) so it can be sampled/shuffled by the system.... this is one of those ops + infra + serialization steps that takes a ton of time to get implemented and costs a ton of time + $$$ every time you run.

https://lancedb.github.io/lance/integrations/tensorflow.html#distributed-training-and-shufflingexanpo

Rather, our paradigm because of polars + arrow + lance, would be to have all of this already computed, and on-disk. This means that there is a lot less "exporting db, loading to clusters, bootstrapping filesystem" . Everything is already partitioned and structured on-disk, ready to go.

So again, things that look like will be solved now "by just using OLTP, or a cluster", I believe will lead to a ton of hurdles and costs further down the timeline. 95% of our pipeline is OLAP, and further from it, it's AI/ML. The more I look at it, the more this makes sense to me.

The I believe the recommendations above can help us separate our core ETL pipelines such that they can be done end-to-end on a serverless environment, at-scale.

Outside this core, layer-of-the-onion, we can then implement other compute solutions and platforms. Such as spark, etc...

Research and tech spikes delivered. Decision is to build out infra w/ duckdb.

oceanprotocol / pdr-backend