The current database (RethinkDB) and the way family-check reports are stored is probably sub-optimal:
Case 1: The database, as of now, shows in the RethinkDB Administration Console 10.4 GB of disk usage, and we do not run the checks very regularly:
We could maybe optimize this, by storing the text-content in an compressed format and by trying to optimize for duplicate text, e.g. by line in an relational/SQL (instead of document database)
Case 2: The query to filter the dashboard results by check-ids is slow. A good schema in an relational/SQL database should make this much faster. It's slow currently, because we use a regex-based string comparison within the family_test documents for each test-entry (which can be thousands per family-report) to match the selected tests. However, a similar optimization may be possible with RethinkDB, using an extra table and optimized indexes.
Case 3: We need a lot of pods (4) to scale the DB-write capacity when doing the full collection run with many parallel workers. An older/more seasoned DB project may be better optimized (see Case 4).
Everything is an order of magnitude more efficient using PostgreSQL than it was with RethinkDB.
It is much easier to do exploratory queries of our data using PostgreSQL than it was with RethinkDB. PostgreSQL is much more expressive than ReQL, has a massive number of built-in functions, so we are making much better use of our data. With RethinkDB, often we just ended up greping through the latest database dump.
PostgreSQL is “statically typed”, whereas RethinkDB had no type or schema enforcement at all; explicit clear typing improved the quality and robustness of our application.
We are saving $800 month (!), due to reduced CPU and disk space requirements.
I had no clue that RethinkDB would be Apache licensed in February 2017.
The current database (RethinkDB) and the way family-check reports are stored is probably sub-optimal:
Case 1: The database, as of now, shows in the RethinkDB Administration Console 10.4 GB of disk usage, and we do not run the checks very regularly:
We could maybe optimize this, by storing the text-content in an compressed format and by trying to optimize for duplicate text, e.g. by line in an relational/SQL (instead of document database)
Case 2: The query to filter the dashboard results by check-ids is slow. A good schema in an relational/SQL database should make this much faster. It's slow currently, because we use a regex-based string comparison within the family_test documents for each test-entry (which can be thousands per family-report) to match the selected tests. However, a similar optimization may be possible with RethinkDB, using an extra table and optimized indexes.
Case 3: We need a lot of pods (4) to scale the DB-write capacity when doing the full collection run with many parallel workers. An older/more seasoned DB project may be better optimized (see Case 4).
Case 4: There's a compelling story of porting a RethinkDB application to PostgreSQL: http://blog.sagemath.com/2017/02/09/rethinkdb-vs-postgres.html
money quote: