Possibility to use Parquet for easy S3 backup?

lalabuy948 / PhoenixAnalytics

📊 Plug and play analytics for Phoenix applications.

https://theindiestack.com/analytics

Apache License 2.0

271 stars 10 forks source link

Possibility to use Parquet for easy S3 backup? #27

Open ruslandoga opened 1 month ago

ruslandoga commented 1 month ago

👋

This is just a question, not a feature request or issue or anything of the sort :)

I'm just starting to learn DuckDB and I wonder if it's a lot of work to make PhoenixAnalytics work with Parquet files? The upside is that they can be easily backed up to S3 (immutable), are quite storage-efficient (40% smaller according to https://benchmark.clickhouse.com/) and almost as fast (60% slower according to https://benchmark.clickhouse.com/) as DuckDB's custom storage format, and have zero load time. The downside is that they might require compaction, but at the same time that enables TTL.

Or maybe there are other ways to stream / backup DuckDB to S3, like in https://motherduck.com/blog/differential-storage-building-block-for-data-warehouse/ or https://litestream.io?

lalabuy948 commented 1 month ago

Hi @ruslandoga, yes, I have this in plans and I already made proof of concept which worked quite well, but for a lot of writes I'm not sure yet how to make it efficient.

As for backups simplest would be to setup a cron job which would dump duckdb file and upload parquet file to s3.

ruslandoga commented 1 month ago

👋 @lalabuy948

but for a lot of writes I'm not sure yet how to make it efficient.

I’m curious about what specific efficiency concerns you’re facing. Do you still have the PoC code available? I’d love to take a look and experiment with it!

Regarding DuckDB, I’m not entirely sure if it supports streaming writes like this:

COPY (stdin) TO 'phoenix_analytics_part_{YYYYMMDD}_{UUID}.parquet.lz4' (FORMAT PARQUET, COMPRESSION LZ4_RAW)

However, since Parquet supports writing in row groups, it should technically be feasible, as it would resemble how ClickHouse handles INSERT operations. If DuckDB doesn't support it directly, there might be other options, such as using libraries like https://docs.rs/parquet/latest/parquet/ or https://github.com/jorgecarleitao/parquet2 or even a native Elixir implementation :)

Another possibility could be writing to a CSV buffer or a temporary table, and then using DuckDB’s COPY to convert it to Parquet.

lalabuy948 commented 1 month ago

🙌🏻 @ruslandoga, there is duckdb support of httpfs, you can take a look, some sketches saved there: priv/repo/duck_s3.exs

More you can research here.

Let me know which ideas you going to try and we can work together on solution! Feel free to reach me over email or on X / Telegram.

ruslandoga commented 1 month ago

Let me know which ideas you going to try and we can work together on solution!

I think I'd like more control over catalogs and uploads. I think DuckDB's httpfs tries to do a ListObjectsV2 on each read_parquet from S3 if it has any wildcards, and that can get expensive. And Hive-style partitioning might be suboptimal (from my limited experience). So I would probably try writing Parquet to disk (possibly not through DuckDB) and uploading it to S3 from Elixir (e.g. every ~20 minutes, that would be around free-tier number of PUTs). It can be done by a single process. Since it would have the full view of which files are available and where (S3 or disk), it can also act as a table catalog with some support for predicate pushdowns (e.g. the user wants to see analytics for a site for the last month, so the catalog only gives the last month worth of that site's Parquet files to DuckDB). That process could also perform compaction (e.g. compact the previous day's Parquet files into a single one) since it could hide incomplete parts from DuckDB.

That's what I'd like to try :)