googlefonts / fontbakery-dashboard

A library-scale web dashboard for Font Bakery, no longer developed
Apache License 2.0
28 stars 10 forks source link

Database review/optimization #93

Open graphicore opened 5 years ago

graphicore commented 5 years ago

The current database (RethinkDB) and the way family-check reports are stored is probably sub-optimal:

Case 1: The database, as of now, shows in the RethinkDB Administration Console 10.4 GB of disk usage, and we do not run the checks very regularly:

selection_215

We could maybe optimize this, by storing the text-content in an compressed format and by trying to optimize for duplicate text, e.g. by line in an relational/SQL (instead of document database)

Case 2: The query to filter the dashboard results by check-ids is slow. A good schema in an relational/SQL database should make this much faster. It's slow currently, because we use a regex-based string comparison within the family_test documents for each test-entry (which can be thousands per family-report) to match the selected tests. However, a similar optimization may be possible with RethinkDB, using an extra table and optimized indexes.

Case 3: We need a lot of pods (4) to scale the DB-write capacity when doing the full collection run with many parallel workers. An older/more seasoned DB project may be better optimized (see Case 4).

Case 4: There's a compelling story of porting a RethinkDB application to PostgreSQL: http://blog.sagemath.com/2017/02/09/rethinkdb-vs-postgres.html

money quote:

  • I’m very happy with the rewrite.
  • Everything is an order of magnitude more efficient using PostgreSQL than it was with RethinkDB.
  • It is much easier to do exploratory queries of our data using PostgreSQL than it was with RethinkDB. PostgreSQL is much more expressive than ReQL, has a massive number of built-in functions, so we are making much better use of our data. With RethinkDB, often we just ended up greping through the latest database dump.
  • PostgreSQL is “statically typed”, whereas RethinkDB had no type or schema enforcement at all; explicit clear typing improved the quality and robustness of our application.
  • We are saving $800 month (!), due to reduced CPU and disk space requirements.
  • I had no clue that RethinkDB would be Apache licensed in February 2017.