lalabuy948 / PhoenixAnalytics

đź“Š Plug and play analytics for Phoenix applications.
https://theindiestack.com/analytics
Apache License 2.0
243 stars 9 forks source link

Hosting DuckDB as separate resource? #8

Closed jaimeiniesta closed 3 weeks ago

jaimeiniesta commented 1 month ago

Hi, I'm excited about this project and wanting to try it, but I'm a newbie to DuckDB and I'm not sure how to set it up.

My application is now running in Heroku, one single dyno, but I don't see a plugin for Heroku to handle DuckDB. There's just a buildpack to run it using Ruby.

What's the best approach to run my own DuckDB?

lalabuy948 commented 1 month ago

Hi @jaimeiniesta, treat duckdb as sqlite. It's a disk persistent database, which is not the best friend for heroku users unfortunately. As I remember heroku ephemeral and doesn't let you store anything on the disk.

Disk backed storage

SQLite runs in memory, and backs up its data store in files on disk. While this strategy works well for development, Heroku’s Cedar stack has an ephemeral filesystem. You can write to it, and you can read from it, but the contents will be cleared periodically. If you were to use SQLite on Heroku, you would lose your entire database at least once every 24 hours.

Even if Heroku’s disks were persistent running SQLite would still not be a good fit. Since SQLite does not run as a service, each dyno would run a separate running copy. Each of these copies need their own disk backed store. This would mean that each dyno powering your app would have a different set of data since the disks are not synchronized.

Instead of using SQLite on Heroku you can configure your app to run on Postgres.

I will work on a solution for that scenario.

jaimeiniesta commented 1 month ago

Thanks for looking into alternatives!

Storing in Postgres isn't a big problem to me, I'm already using the main Postgres app for Oban and ErrorTracker and everything works smoothly. I don't have many exceptions but I certainly have hundreds of thousands of Oban jobs, I think I can also handle the analytics data.

Or, maybe an alternative would be a second Postgres DB separate for analytics.

Anyway, you've already investigated and I trust your decision to use DuckDB or Clickhouse, I'll stay tuned. Thanks!

lalabuy948 commented 1 month ago

My goal was to try something new and do not effect main transactional database. And since I'm vps guy over heroku/fly I honestly forgot about such limitation. Now I see that Postgres backend would solve most of issues for such cases, on which I will work for sure.

Up for now I found quite nicely working solution via duckdb httpfs extension, which allows to persist data on external S3, R2, GCS and Minio. I think there is an option on heroku to add bucketeer plugin which is AWS S3.

query = "SELECT count(*) FROM 's3://test/analytics.parquet' WHERE method = 'GET';"
{:ok, ref} = PhoenixAnalytics.Repo.execute_unsafe(query)
Duckdbex.fetch_all(ref) |> IO.inspect()

1 million requests is 56mb of parquet data, which is not that bad to store on S3 and perhaps split by month/year depending on your site popularity.

lalabuy948 commented 1 month ago

I think I found another good solution for postres users, since I'm using cqrs and writes / reads completely separated we can write requests to users postgres (using existing project repo) and then read data via duckdb postgres extension. Which will bring best from both worlds.

ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE POSTGRES, READ_ONLY);

Up for now I really liked this one.

jaimeiniesta commented 1 month ago

Sounds cool, do you think that could be flexible enough to use the same repo or have another repo dedicated to analytics?

lalabuy948 commented 1 month ago

I think that it will be in same repo, as I plan to keep it small and simple.

lalabuy948 commented 3 weeks ago

Hi @jaimeiniesta , I'm going to close this issue up for now, let know if you face any difficulties with 0.2.0 release. Let's keep in touch over email, happy to assist with anything!