Sharding is a fairly complex and involved process that should be done as the last resort after all other optimisation methods are exhausted. Easy optimisations we haven't yet done are for instance adding MongoDB indices. Sharding affects the infrastructure, database collections and might even require changes to some of the MongoDB queries we do. As part of implementing sharding we need to analyse with what kind of filtering parameters each collection is queried so that we ensure all documents that are queried together always stay in just one bucket. Additionally, we also need to measure and monitor the data distribution between shards. Once you shard OpenCRVS you cannot un-shard OpenCRVS so this increases complexity considerably. From a core perspective even unsharded OpenCRVS releases need to be maintained as if they could be configured to be sharded. It is a major commitment.
Infrastructure
Required nodes:
config node * number of replicas for it
"Config servers store the metadata for a sharded cluster. The metadata reflects state and organization for all data and components within the sharded cluster. The metadata includes the list of chunks on every shard and the ranges that define the chunks." (source)
Generally speaking the chosen shard key per collection should reflect the most common queries we make to it. If for instance documents are often fetched based on the creator, the shard key should be the author id.
Indices for all shard keys must be created before collections are sharded.
user id or a more specific event type would be ideal, but none of those are available as a static key in the document
Composition: id with hashed sharding
better alternative would be to see if we can add additional fields to the document and have a creator id there for instance
DocumentReference: id with hashed sharding
Encounter: id with hashed sharding
location reference would work great, but it's an array so it can't be used as an shard key (source)
Location: is location worth sharding? the number of documents is fairly static, right?
Observation: context.reference
Point to an Encounter resource. All observations of the same encounters would thus be stored on the same shard.
Patients: id with hashed sharding
Again, there is no obvious choice for a shard key. It's for some parts because of the relational structure of the database model where not all resources have references to a parent resource
Patient_history: id with hashed sharding
id refers to Patient, so this is ok
Practitioner: id with hashed sharding
PractitionerRole: practitioner.reference
Task: focus.reference
Task_history: focus.reference
Tasks
Create a sharded MongoDB setup locally (example) and configure Hearth to use it.
Configure shards manually or with a docker initialisation script for local environment
Use application, insert new data and analyze the data distributing between shards, make modifications to shard configuration if needed
Create the required infrastructure
Amend our Ansible playbooks & docker compose file to spin up right amount of MongoDB containers on correct nodes
Modify the current MongoDB initialisation script to include configuring sharding for all collections
Verity that shard monitoring / node disk space monitoring is present in our current monitoring setup
Summary
Sharding is a fairly complex and involved process that should be done as the last resort after all other optimisation methods are exhausted. Easy optimisations we haven't yet done are for instance adding MongoDB indices. Sharding affects the infrastructure, database collections and might even require changes to some of the MongoDB queries we do. As part of implementing sharding we need to analyse with what kind of filtering parameters each collection is queried so that we ensure all documents that are queried together always stay in just one bucket. Additionally, we also need to measure and monitor the data distribution between shards. Once you shard OpenCRVS you cannot un-shard OpenCRVS so this increases complexity considerably. From a core perspective even unsharded OpenCRVS releases need to be maintained as if they could be configured to be sharded. It is a major commitment.
Infrastructure
Required nodes:
Minimum infra setup: 5 nodes
Shard keys per collection
Find an appropriate sharding key
Generally speaking the chosen shard key per collection should reflect the most common queries we make to it. If for instance documents are often fetched based on the creator, the shard key should be the author id.
Indices for all shard keys must be created before collections are sharded.
Tasks
Sources