learningequality / morango

Pure Python sqlite-based Django DB replication engine
MIT License
14 stars 22 forks source link

Instance ID changes too frequently #79

Closed jamalex closed 4 years ago

jamalex commented 4 years ago

We currently calculate the instance_id (InstanceIDModel) with the utmost priority placed on it being as unique as possible. This is because having two devices with the same instance ID generating data in parallel will cause problems if their respective data is ever synced to the same place.

We currently do this by incorporating as many sources of potential "device uniqueness" as possible into a big hash, in hopes that no two devices will have the same combination of all these values: https://github.com/learningequality/morango/blob/5da47d241aeb7c48deed681a54589f62bdd28ede/morango/models/core.py#L118-L145

This has served Morango's purposes well. The instance_id can change on moderately frequent intervals (e.g. if Python or kernel is upgraded), but that has no ill effects other than very nominal increases in database size. However, we've also started leaning on the instance_id for other purposes that would do better with more instance_id stability over time, including:

jamalex commented 4 years ago

For this reason, we're proposing updating the calculation of instance_id to use the following flow:

First, calculate the instance_id based on the old method. If it matches the current instance_id, keep using that. This will avoid unnecessary churn if the id wasn't going to change anyway.

Otherwise, to calculate a "new" instance_id:

A. Calculate a system identifier, in the following order of preference:

B. Perform a uuid5 of the database_id and the system ID calculated above, to get the instance_id. This ensures that multiple parallel installations (with distinct databases) on the same physical machine will not be given the same instance_id, and avoid issues if the above system IDs somehow collided (e.g. across machines that were badly made and given identical serial numbers/MACs, etc) -- as long as the hard drives weren't also cloned without resetting the database_id.

jamalex commented 4 years ago

It may still make sense to include both the system IDs and the MACs in the hash, as those should in normal cases never (or very rarely) change. The main sources of unwanted frequent changes were coming from Python/kernel upgrades.

May also want to have two different environment variables -- one (probably keeping MORANGO_SYSTEM_ID) that just adds some extra pieces to the hash, to avoid collisions, and another that overrides everything except database_id, to cause collisions (e.g. for a multi-server/single-DB cloud deployment like Cloud Kolibri or KDP).

jamalex commented 4 years ago

image

As part of this, we should make InstanceIDModel stuff more robust to duplicate instance ID results.

jamalex commented 4 years ago

It may still make sense to include both the system IDs and the MACs in the hash, as those should in normal cases never (or very rarely) change.

This is the way I ended up going in https://github.com/learningequality/morango/pull/83