Open ruslandoga opened 1 month ago
Hi @ruslandoga, yes, I have this in plans and I already made proof of concept which worked quite well, but for a lot of writes I'm not sure yet how to make it efficient.
As for backups simplest would be to setup a cron job which would dump duckdb file and upload parquet file to s3.
๐ @lalabuy948
but for a lot of writes I'm not sure yet how to make it efficient.
Iโm curious about what specific efficiency concerns youโre facing. Do you still have the PoC code available? Iโd love to take a look and experiment with it!
Regarding DuckDB, Iโm not entirely sure if it supports streaming writes like this:
COPY (stdin) TO 'phoenix_analytics_part_{YYYYMMDD}_{UUID}.parquet.lz4' (FORMAT PARQUET, COMPRESSION LZ4_RAW)
However, since Parquet supports writing in row groups, it should technically be feasible, as it would resemble how ClickHouse handles INSERT operations. If DuckDB doesn't support it directly, there might be other options, such as using libraries like https://docs.rs/parquet/latest/parquet/ or https://github.com/jorgecarleitao/parquet2 or even a native Elixir implementation :)
Another possibility could be writing to a CSV buffer or a temporary table, and then using DuckDBโs COPY to convert it to Parquet.
๐๐ป @ruslandoga, there is duckdb support of httpfs
, you can take a look, some sketches saved there: priv/repo/duck_s3.exs
More you can research here.
Let me know which ideas you going to try and we can work together on solution! Feel free to reach me over email or on X / Telegram.
Let me know which ideas you going to try and we can work together on solution!
I think I'd like more control over catalogs and uploads. I think DuckDB's httpfs
tries to do a ListObjectsV2 on each read_parquet
from S3 if it has any wildcards, and that can get expensive. And Hive-style partitioning might be suboptimal (from my limited experience). So I would probably try writing Parquet to disk (possibly not through DuckDB) and uploading it to S3 from Elixir (e.g. every ~20 minutes, that would be around free-tier number of PUTs). It can be done by a single process. Since it would have the full view of which files are available and where (S3 or disk), it can also act as a table catalog with some support for predicate pushdowns (e.g. the user wants to see analytics for a site for the last month, so the catalog only gives the last month worth of that site's Parquet files to DuckDB). That process could also perform compaction (e.g. compact the previous day's Parquet files into a single one) since it could hide incomplete parts from DuckDB.
That's what I'd like to try :)
๐
This is just a question, not a feature request or issue or anything of the sort :)
I'm just starting to learn DuckDB and I wonder if it's a lot of work to make PhoenixAnalytics work with Parquet files? The upside is that they can be easily backed up to S3 (immutable), are quite storage-efficient (40% smaller according to https://benchmark.clickhouse.com/) and almost as fast (60% slower according to https://benchmark.clickhouse.com/) as DuckDB's custom storage format, and have zero load time. The downside is that they might require compaction, but at the same time that enables TTL.
Or maybe there are other ways to stream / backup DuckDB to S3, like in https://motherduck.com/blog/differential-storage-building-block-for-data-warehouse/ or https://litestream.io?