superfly / corrosion

Gossip-based service discovery (and more) for large distributed systems.
https://superfly.github.io/corrosion/
Apache License 2.0
684 stars 21 forks source link

SQL logic error when schemas out of sync #261

Open psviderski opened 5 days ago

psviderski commented 5 days ago

When DB schema is changed (e.g. a new column is added) on node A and not on node B, inserts/updates from node A using new schema fail to apply on node B:

2024-09-26T10:59:26.534155Z ERROR corro_agent::agent::handlers: could not process multiple changes: rusqlite: SQL logic error (actor_id: Some(ActorId(f1d3880f-c3b7-4cbb-b0df-6e72a4459517)), version: None)
2024-09-26T10:59:27.033472Z ERROR process_multiple_changes:process_single_version:process_complete_version{actor_id=ActorId(f1d3880f-c3b7-4cbb-b0df-6e72a4459517) versions=Version(4)..=Version(4)}: corro_agent::agent::util: error=SQL logic error
2024-09-26T10:59:27.033544Z ERROR process_multiple_changes:process_single_version: corro_agent::agent::util: error=SQL logic error
2024-09-26T10:59:27.033636Z ERROR process_multiple_changes: corro_agent::agent::util: error=rusqlite: SQL logic error (actor_id: Some(ActorId(f1d3880f-c3b7-4cbb-b0df-6e72a4459517)), version: None)

A similar error occurs when trying to process a change for a new table that doesn't exist on node B:

2024-09-26T11:07:32.499239Z ERROR corro_agent::agent::handlers: could not process multiple changes: rusqlite: crsql - could not find the schema information for table todos (actor_id: Some(ActorId(a424b2f7-a75c-49dd-9f8e-cf1d5c4f0fc5)), version: None)
2024-09-26T11:07:47.472566Z  INFO corro_agent::agent::handlers: synced 26 changes w/ f1d3880fc3b74cbbb0df6e72a4459517 in 0.007235708s @ 3593.290387063712 changes/s
2024-09-26T11:07:47.499287Z ERROR process_multiple_changes:process_single_version:process_complete_version{actor_id=ActorId(a424b2f7-a75c-49dd-9f8e-cf1d5c4f0fc5) versions=Version(1)..=Version(1)}: corro_agent::agent::util: error=crsql - could not find the schema information for table todos
2024-09-26T11:07:47.499360Z ERROR process_multiple_changes:process_single_version: corro_agent::agent::util: error=crsql - could not find the schema information for table todos
2024-09-26T11:07:47.499462Z ERROR process_multiple_changes: corro_agent::agent::util: error=rusqlite: crsql - could not find the schema information for table todos (actor_id: Some(ActorId(a424b2f7-a75c-49dd-9f8e-cf1d5c4f0fc5)), version: None)

This behavior seems reasonable to prevent data loss. However, it requires every node in the cluster to run the same schema version. To comply with that requirement, what should be the process for deploying schema changes to the cluster? One approach could be having another distributed table that would store the latest schema along with a separate process that would watch the latest schema and synchronize it to db.schema_paths followed by executing corrosion reload. Is there a better way to handle this? How do you manage schema changes at Fly?

jeromegn commented 5 days ago

That's a great question. Schema synchronization has been largely left to users.

At Fly: We have a repository containing Corrosion-related configuration that we deploy independently.

Usually this is a multi-step process, similar to a migration for a normal database:

If we do this properly, every node has the same schema and there are no errors. If there are no errors, that's ok because nodes will keep trying to synchronize versions they haven't been able to apply.

It's not perfect and a bit tedious... if you want to distribute the schema automatically via Corrosion, that sounds doable in the way you've described it, but a little scary sometimes. Automatic schema propagation was something we decided not to do to keep more explicit control of what happens to the schema.