MongoDB sharding for refactoring Hearth for a huge country scale. e.g. Nigeria

rikukissa commented 2 years ago

Summary

Sharding is a fairly complex and involved process that should be done as the last resort after all other optimisation methods are exhausted. Easy optimisations we haven't yet done are for instance adding MongoDB indices. Sharding affects the infrastructure, database collections and might even require changes to some of the MongoDB queries we do. As part of implementing sharding we need to analyse with what kind of filtering parameters each collection is queried so that we ensure all documents that are queried together always stay in just one bucket. Additionally, we also need to measure and monitor the data distribution between shards. Once you shard OpenCRVS you cannot un-shard OpenCRVS so this increases complexity considerably. From a core perspective even unsharded OpenCRVS releases need to be maintained as if they could be configured to be sharded. It is a major commitment.

Infrastructure

Required nodes:

config node * number of replicas for it
- "Config servers store the metadata for a sharded cluster. The metadata reflects state and organization for all data and components within the sharded cluster. The metadata includes the list of chunks on every shard and the ranges that define the chunks." (source)
number of shards * number of replicas for each
router node (documentation)

Minimum infra setup: 5 nodes

2 config nodes
2 shards and 2 replicas
router node

Shard keys per collection

Find an appropriate sharding key

Generally speaking the chosen shard key per collection should reflect the most common queries we make to it. If for instance documents are often fetched based on the creator, the shard key should be the author id.

Indices for all shard keys must be created before collections are sharded.

AuditEvent: id with hashed sharding
- user id or a more specific event type would be ideal, but none of those are available as a static key in the document
Composition: id with hashed sharding
- better alternative would be to see if we can add additional fields to the document and have a creator id there for instance
DocumentReference: id with hashed sharding
Encounter: id with hashed sharding
- location reference would work great, but it's an array so it can't be used as an shard key (source)
Location: is location worth sharding? the number of documents is fairly static, right?
Observation: context.reference
- Point to an Encounter resource. All observations of the same encounters would thus be stored on the same shard.
Patients: id with hashed sharding
- Again, there is no obvious choice for a shard key. It's for some parts because of the relational structure of the database model where not all resources have references to a parent resource
Patient_history: id with hashed sharding
- id refers to Patient, so this is ok
Practitioner: id with hashed sharding
PractitionerRole: practitioner.reference
Task: focus.reference
Task_history: focus.reference

Tasks

Create a sharded MongoDB setup locally (example) and configure Hearth to use it.
Configure shards manually or with a docker initialisation script for local environment
Use application, insert new data and analyze the data distributing between shards, make modifications to shard configuration if needed
Create the required infrastructure
Amend our Ansible playbooks & docker compose file to spin up right amount of MongoDB containers on correct nodes
Modify the current MongoDB initialisation script to include configuring sharding for all collections
Verity that shard monitoring / node disk space monitoring is present in our current monitoring setup
- There should be support in Netdata for sharding metrics
  Sources
Convert a Replica Set to a Sharded Cluster

euanmillar commented 2 years ago

This is blocked until the Hearth replacement that includes indices is set up

rikukissa commented 10 months ago

@euanmillar ok if I close this ticket? I think it's not the right choice for us now or in the future

opencrvs / opencrvs-core