spacemeshos / go-spacemesh

Go Implementation of the Spacemesh protocol full node. 💾⏰💪
https://spacemesh.io
MIT License
752 stars 212 forks source link

Better database schema handling #6026

Open ivan4th opened 5 months ago

ivan4th commented 5 months ago

Description

We need a better database schema handling mechanism as there are a few problems with the current one.

No place to look for the schema in the source

Right now, even if the database is a new one, the schema is pieced together by running a number of migrations from sql/migrations/{local,state}. The resulting full schema can be observed by running sqlite3 ... .schema command on a database, but that's not very convenient, and schema evolution across different go-spacemesh versions can't be easily seen by inspecting history of a schema file.

Schema drift

If the database schema is changed in an unexpected way for whatever reason, this may go undetected and cause some hard to debug issues. Suppose a user has run a go-spacemesh version built from develop that added a migration, and then switched to a later stable version which also has this migration, but a newer version of it that adds an index. PRAGMA user_version tells go-spacemesh the database schema is up-to-date, yet actually the index is missing. This leads to a substantial performance degradation, but given that we don't know that the user has used an unreleased version, we might spent a lot of time trying to understand why this particular go-sm instance is THAT slow. Granted, we strongly recommend against using unreleased versions on normal nodes, yet the user might have done that by mistake and/or forgotten about it.

Package inconsistency

sql package used for the state database and sql/localsql package used for the local database. The database type for localsql.Database builds on sql.Database while database type for the state database uses sql.Database directly.

Inconvenient coded migrations

When a migration has to be implemented as Go code, it cannot be placed in sql or sql/localsql package, and thus cannot be automatically used by {sql,localsql}.{Open,InMemory} methods which open a database; this is because coded migration use sql package for accessing the database along with some packages that might import sql, and Open / InMemory functions using them would cause a cyclic dependency. The places in the code where the databases are opened, such as node.go, tests and the merge tool, have to import the migrations and add them via options to the Open / InMemory calls. It would be much better if the code that uses the database could avoid concerns about internal DB schema workings such as particular migrations.

Suggested implementation and its problems

The suggested implementation of new database schema handling is here: #6003 Additional description link (README): https://github.com/spacemeshos/go-spacemesh/blob/aada7c61b5e61505404bbd4cfdd7b4deeb7ca45a/README.md#handling-database-schema-changes

The crux of the new approach is this (but still please read the above descriptions before proceeding to the following sections):

There are numerous possible problems with the PR #6003 though which are discussed below. I try to provide rationales for the existing decisions and also describe what's still possibly missing.

Golden test approach

One possible approach to maintaining the schema.sql files would be a dedicated tool that could be run as a part of CI pipeline, along with the linter, checking source formatting and generated Go files. That sounds cleaner, but on the other hand, that would require building an extra tool that would have to be invoked separately when dealing with code changes that involve database migrations. In my opinion, that would add some hassle to the development workflow.

Because of that, I decided to cut some corners and have a test that verifies the consistency of the schema of each database. Moreover, in case of any differences between the schema that is created by running all the migrations and the schema stored in the schema.sql file, the test produces schema.sql.updated which can be used to "accept" the changes easily (replace schema.sql with it). This is essentially the Golden test approach. The difference is that the schema.sql file is used not just for testing, but also for initialization of fresh databases.

I'm not 100% sure the golden test approach involving a Go test is the best. Maybe we should just have the schema tool and make check-db-schema and make update-db-schema commands that run the tool under the hood, and have the CI do make check-db-schema.

Avoiding code duplication in the tests

Each of the sql/statesql and sql/localsql packages need to have a pair of tests (statesql, localsql) that verify the same thing, but with different schemas:

  1. Idempotent migrations: reopening a database after migrations are run on it should not cause any schema changes
  2. Schema consistency: after a fresh database is opened, it schema exactly matches the schema in schema.sql; the same holds for the case when migrations are forced to run on the fresh database instead of using schema.sql script.

The corresponding tests in sql/localsql and sql/statesql are merely stubs that call the actual test code in sql/test package. This arguably hinders readability and in other cases, it would probably make sense to inline the tests in sql/localsql and sql/statesql and remove the sql/test package. But in this case, the tests double as a tool for schema consistency verification due to aforementioned "corner cutting" with schema tests serving as the "schema tool".

So: not just tests, but the schema tool. It would be really undesirable for the schema tool implementation to drift apart in sql/statesql and sql/localsql packages. Arguably, it could be said that then mostly duplicated InMemory / Open / Schema functions in the sql/{statesql,localsql} don't make sense either, but that's not really the case, as these are supposed to be allowed to drift apart, wrt e.g. the set of migrations and connection pool size, whereas the schema tests verify the fundamental schema properties which should always be the same for both sql/statesql and sql/localsql. Should sql/{statesql,localsql} need tests that verify things that are different between the databases, these can be added to the packages along in addition to the stubs that use sql/test.

Circular dependencies in coded migrations not fully fixed

One of the issues with existing database schema handling mechanism is that the coded migrations placed in sql and sql/localsql packages may cause circular dependencies. With the new mechanism, the situation is somewhat improved, but it is still problematic. All the code that used *sql.Database package previously is switched to using *statesql.Database, and the local database code still uses *localsql.Database. If a migration imports some package that happens to import sql/{statesql,localsql}, we've still got a circular dependency.

One way around it is making sql.Database an interface (instead of a struct) which includes sql.Executor but also provides other methods: Tx, WithTx, WithTxImmediate, QueryCount, QueryCache and Close. Using this interface in all the non-test files instead of *statesql.Database and *localsql.Database will resolve the circular dependency as the coded migrations in sql/{statesql,localsql} will be able to import the packages that only use sql package. The tests in the packages that use sql.Database interface may still import sql/{statesql,localsql} packages to use their InMemory / Open / Schema functions as this does not constitute a circular dependency (importing these packages ignores tests in them).

The problem with such an improvement is that it loses the type-based distinction between *statesql.Database and *localsql.Database. This can be mitigated by having sql.StateDatabase and sql.LocalDatabase interfaces that differ in some dummy method, e.g. IsStateDB() bool vs IsLocalDB() bool. This sounds somewhat hacky and I'm not sure it's the way to go; on the other hand, the errors caused by mistakenly using *localsql.Database instead of *statesql.Database and vice versa are usually very easily caught by tests, so the value of these additional interfaces can be disputed.

Allowing "some" schema drift

Some databases may contain artifacts of additional tooling used, such as _litestream* table from Litestream SQLite replication. Having a way to avoid it is rather useful for infra maintenance purposes, thus I think having main.db-schema-ignore-rx setting for ignoring certain tables and indices when checking for schema drift does make sense. I think a single regexp is enough as it can include multiple name prefixes with |.

Failing upon schema drift by default

An open question is whether go-spacemesh should fail if schema drift is detected, if this behavior needs to be configurable, and if it should fail by default if it is so.

What I'm afraid of is that if the schema drift happens due to some table added by accident to a quicksync db or some other unanticipated effect like this, we can get a lot of complaints about Smapp failures because of schema drift, with reasons that may be very confusing to the end-user. On the other hand, if we just log a warning upon schema drift (like it's done currently in #6003) we can still see schema drift warning easily when we look at go-spacemesh logs.

Programmers' docs for database schema handling

README.md already contains some information on the devenv, such as using different make commands, running the tests and setting up the build environment. I've added a section on handling DB schema changes to README.md, but it seems to be somewhat out of place. CONTRIBUTING.md seems to be also not very well suited for such detailed descriptions.

Should we have a separate HACKING.md or something like that?

Printing schema diffs

It is argued that in the schema consistency test, we can rely on plain require.Equal for printing schema diff instead of using go-cmp-based schema differ that is used to generate schema drift warnings. While such approach would probably simplify the test a bit, it would also produce the full schema.sql script (original and updated) printed as a big unreadable wall of escape text (\n instead of newlines) before the actual diff. In my opinion, that would reduce the usability of the schema consistency test by cluttering the terminal with too much unusable output and thus the trade-off is not worth it.

fasmat commented 5 months ago

Regarding Avoiding code duplication in the tests:

I don't see any reason to have a generic test for two different packages/implementations. Either we should make the package itself generic and have a single test that covers the general case (and is not generic) or we have 2 packages with two separate test implementations.

I don't see an issue in having duplicated code in tests - quite the contrary. I think we should avoid DRY-ing up tests, because each test should stand for itself as much as possible (besides a general setup / teardown approach). Making tests dry quickly makes them hard to change & debug.

duplicated InMemory / Open / Schema functions in the sql/{statesql,localsql} don't make sense either, but that's not really the case, as these are supposed to be allowed to drift apart, wrt e.g. the set of migrations and connection pool size, whereas the schema tests verify the fundamental schema properties which should always be the same for both sql/statesql and sql/localsql.

This makes no sense to me - the implementation is the same but is allowed to differ in the future, but although tests are also the same they must not diverge?

Regarding Allowing "some" schema drift:

Some databases may contain artifacts of additional tooling used, such as _litestream* table from Litestream SQLite replication.

I fear this being used by members of the community to bypass the exact thing we are trying to avoid: schema drifts. If for some reason (manually editing the DB, using unreleased builds, etc.) the schema drifts users will "fix" this by just using this configuration to ignore the DBs that differ.

And if the default action upon a schema drift is to only log a warning - which in my opinion it should not be, since a drift can can lead to data corruption - then there is no reason to allow users to get rid of that warning by ignoring tables that differ from the expected schema.

Regarding generating a new schema:

The current suggested workflow is to execute a test that will fail if the schema was updated and produce a file that then has to be renamed and committed. This is a bit clunky to me. We do have make generate which internally calls go generate. A //go:generate sh -c "<command that builds schema form migration files and puts it where it belongs>" would better fit the existing workflow on how to deal with generated files.

It would still serve as a check against unintended schema changes since our CI checks if make generate modifies files or not.

ivan4th commented 5 months ago

This makes no sense to me - the implementation is the same but is allowed to differ in the future, but although tests are also the same they must not diverge?

Thing is these are not just tests, but rather essentially the schema tool that helps maintaining schema.sql in sync with migrations. That must happen in the exactly same way for both localsql and statesql. I'd prefer the separate schema tool route to duplication of that code, so indeed I should probably update the PR to use go:generate approach.

Re failing by default upon schema drift: I'd welcome @pigmej's opinion on that. As I described in the issue, the scenario I'm afraid of is some stray table in quicksync db that breaks every Smapp installation. With schema drift producing just a warning, it's still useful for diagnostics given that we usually request the application log from the users.

Re not allowing schema drift exceptions: there are a lot of advanced options in go-spacemesh that one can utilize as a footgun ;) This particular option I'd rather keep as it can be useful under various circumstances. In different ops related scenarios the need for such exceptions may arise, and another option (just ignoring the warning b/c some kind of schema drift is expected) may lead to missing unexpected schema drifts. We may add a warning to the docs that mention that option, stating explicitly that one needs to know what (s)he's doing when changing its value.

pigmej commented 5 months ago

if the litestream is the issue, then we can get rid of it. We can disable it, and / or clear the schema.

ivan4th commented 5 months ago

@pigmej litestream is easy to ignore. The problem is a bit more general: is it ok for go-spacemesh to stop immediately when schema drift is detected, unless this behavior is explicitly disabled in the config, or are our chances of messing this up a bit too high?

pigmej commented 5 months ago

Hard to say, if it can be disabled in the config then imo its ok to crash

fasmat commented 5 months ago

I would also say return an error when opening a DB and the schema doesn't match (that error should gracefully shut down the node before it even fully started) and allow to disable that behavior in which case it logs with a warning that the schema is different and how it is different, no need to add another config to disable this warning in some cases where a drift is expected / desired in my opinion.

ivan4th commented 5 months ago

Ok, you convinced me :) So I'll do the following changes to the current implementation:

ivan4th commented 5 months ago

One thing that was not fully decided above is where to put the developer docs for schema handling, which I'd prefer to have somewhere (README.md seems not to be very appropriate place for that). Perhaps should indeed add HACKING.md for that purpose?

fasmat commented 5 months ago

One thing that was not fully decided above is where to put the developer docs for schema handling, which I'd prefer to have somewhere (README.md seems not to be very appropriate place for that). Perhaps should indeed add HACKING.md for that purpose?

Normally I would expect something like this in CONTRIBUTING.md maybe under Code Guidelines or a similar section. That file however is a bit outdated 🤔