Jean-Baptiste-Lasselle commented 4 years ago

résumé

mettre en oeuvre un cluster :

avec N shards
et pour chaque shard, un replicaSet :
- de P_1 répliques , pour le shard no. 1
- ...
- de P_N répliques , pour le shard no. N

article trouvé sur stackexchange

125

A Replica-Set means that you have multiple instances of MongoDB which each mirror all the data of each other. A replica-set consists of one Master (also called "Primary") and one or more Slaves (aka Secondary). Read-operations can be served by any slave, so you can increase read-performance by adding more slaves to the replica-set (provided that your client application is capable to actually use different set-members). But write-operations always take place on the master of the replica-set and are then propagated to the slaves, so writes won't get faster when you add more slaves.

Replica-sets also offer fault-tolerance. When one of the members of the replica-set goes down, the others take over. When the master goes down, the slaves will elect a new master. For that reason it is suggested for productive deployment to always use MongoDB as a replica-set of at least three servers, two of them holding data (the third one is a data-less "arbiter" which is required for determining a new master when one of the slaves goes down).

A Sharded Cluster means that each shard of the cluster (which can also be a replica-set) takes care of a part of the data. Each request, both reads and writes, is served by the cluster where the data resides. This means that both read- and write performance can be increased by adding more shards to a cluster. Which document resides on which shard is determined by the shard key of each collection. It should be chosen in a way that the data can be evenly distributed on all clusters and so that it is clear for the most common queries where the shard-key resides (example: when you frequently query by user_name, your shard-key should include the field user_name so each query can be delegated to only the one shard which has that document).

The drawback is that the fault-tolerance suffers. When one shard of the cluster goes down, any data on it is inaccessible. For that reason each member of the cluster should also be a replica-set. This is not required. When you don't care about high-availability, a shard can also be a single mongod instance without replication. But for production-use you should always use replication.

So what does that mean for your example?

                            Sharded Cluster             
             /                    |                    \
      Shard A                  Shard B                  Shard C
        / \                      / \                      / \
+-------+ +---------+    +-------+ +---------+    +-------+ +---------+
|Primary| |Secondary|    |Primary| |Secondary|    |Primary| |Secondary|
|  25GB |=| 25GB    |    | 25 GB |=| 25 GB   |    | 25GB  |=| 25GB    |   
+-------+ +---------+    +-------+ +---------+    +-------+ +---------+

When you want to split your data of 75GB into 3 shards of 25GB each, you need at least 6 database servers organized in three replica-sets. Each replica-set consists of two servers who have the same 25GB of data.

You also need servers for the arbiters of the three replica-sets as well as the mongos router and the config server for the cluster. The arbiters are very lightweight and are only needed when a replica-set member goes down, so they can usually share the same hardware with something else. But Mongos router and config-server should be redundant and on their own servers.

edited Sep 17 '15 at 8:11

answered Nov 21 '13 at 12:59

Philipp

1,4351 gold badge8 silver badges7 bronze badges

2

Thanks a lot for the detail answer...one more question...if the primary is down while a write or read operation is being carried out then..1) what is the delay in selecting the primary from the secondaries and 2) during that delay where will be the data be stored temporarily? – Saad Saadi Nov 22 '13 at 6:58
4

@SaadSaadi The primary election process is described in the documentation. It takes between 10 and 12 seconds for the secondaries to notice that the primary is down. The primary election itself will usually only take milliseconds. The replica-set is read-only while there is no primary. Any attempts from applications to write data during this time will fail. – Philipp Nov 22 '13 at 8:23
1

@Philipp: Just two comments: (1) the shard key cannot be modified (i.e. you cannot shard using a different key) and (2) you can read from the secondary nodes of the replica set but consistency depends from the write concern (in order to be consistent the w option should be equal to the replica set sth which is not viable since each shard may have different replica set sizes deliberately or due to node failures). – Mike Argyriou Sep 17 '15 at 9:30
@Philipp can you please answer further follow up questions on dba.stackexchange.com/questions/208482/… ? – user3198603 Jun 1 '18 at 13:37

add a comment |

Jean-Baptiste-Lasselle commented 4 years ago

https://docs.mongodb.com/manual/tutorial/deploy-shard-cluster/
https://www.howtoforge.com/tutorial/deploying-mongodb-sharded-cluster-on-centos-7/ : pour ce tuto, changerla policy SELinux, en suicvant les recommandations officielles https://docs.mongodb.com/manual/tutorial/install-mongodb-on-red-hat/ (pour voir si les shards marchent toujours)

Brouillon

# on the centos hosts, so that matches my containers

# --- # 
# To disable SELinux
sudo sed -i "s#SELINUX=*#SELINUX=disabled#g" /etc/sysconfig/selinux 

# -- reboot host

#

ansible virtual inventory (each hosts is a container) (will become base k8s services) :


# --- # 
# 2 centOS 7 server as Config Replica Sets
# --- # 
[mongo-config-services]
10.0.15.31      configsvr1
10.0.15.32      configsvr2
# --- # 
#  4 CentOS 7 server as Shard Replica Sets
# --- # 
[mongo-shards]
10.0.15.21      shardsvr1
10.0.15.22      shardsvr2
10.0.15.23      shardsvr3
10.0.15.24      shardsvr4

1 CentOS 7 server as mongos/Query Router [mongo-query-router] 10.0.15.11 mongos

* linux user etc in container : we'll see that later. We test as root first, erase any issue, then we'll lower permissions as low as possible, and use non root users everywhere, should it take installing `sudo` and edit a custom `/etc/sudoers.d/mongo.sudoers` configuration.
* Each server connected to another server

So alright, we have a basic K8S resources definition layout here, plus I believe we need a stateful set for each service (router, config, and mongo)  : 

```bash
# mongo config server service
config-server/mongo-config-server-service.yaml
config-server/mongo-config-server-statefulset.yaml
# mongo shards service
router/mongo-shards-service.yaml
router/mongo-shards-statefulset.yaml
# mongo query router service
router/mongo-query-router-service.yaml
router/mongo-query-router-statefulset.yaml

mongo-query-router-service.yaml :

apiVersion: v1
kind: Service
metadata:
  name: mongo-qrouter
  labels:
    app: mongo-qrouter
spec:
  publishNotReadyAddresses: true
  clusterIP: None
  ports:
    - port: 27017
      name: mongo-qrouter
  selector:
    app: mongo-qrouter

mongo-query-router-statefulset.yaml

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: mongo-qrouter
  namespace: default
spec:
  replicas: 4
  selector:
    matchLabels:
      app: mongo-qrouter
  template:
    metadata:
      labels:
        app: mongo-qrouter
    spec:
      containers:
        - name: mongo-qrouter
          image: 'pegasusio/mongo-qrouter:0.0.1'
          args:
            - mongo
            - '--config-opt-1'
            - 'valueConfOpt1'
            # [...] and as many args as needed to start properly mongo qrouter process
          ports:
            - containerPort: 27017
              protocol: TCP
          env:
            - name: MONGO_ENV_VAR_1
              value: mongorocks
           # use hashicorp vault secret management here, from 
           # kubernetes secret manager integration
            - name: MONGO_SECRET
              value: SD-6sfdsfggsTHGdfggsdfgjsd
          resources: {}
          volumeMounts:
            - name: data
              mountPath: /mongo-qrouter-data
          livenessProbe:
            httpGet:
              path: /test
              port: 27017
              scheme: HTTP
            initialDelaySeconds: 120
            timeoutSeconds: 1
            periodSeconds: 20
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /test
              port: 27017
              scheme: HTTP
            initialDelaySeconds: 120
            timeoutSeconds: 1
            periodSeconds: 20
            successThreshold: 1
            failureThreshold: 3
#           terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler

  # Volume Claims will be launched with scaling out shards with this stateful set , based on
  # this template (10 Gib each shard) : 
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: mongo-qrouter
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        volumeMode: Filesystem
      status:
        phase: Pending
  serviceName: mongo-qrouter
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
status:
  replicas: 4

mongo-shards-service.yaml :

kind: Service
apiVersion: v1
metadata:
name: mongo-shards
namespace: default
spec:
ports:
- protocol: TCP
  port: 27017
  targetPort: 27017
  nodePort: 37017
selector:
app: mongo-shards
type: LoadBalancer
sessionAffinity: None
externalTrafficPolicy: Cluster
status:
loadBalancer: {}

mongo-shards-statefulset.yaml

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: mongo-shards
  namespace: default
spec:
  replicas: 4
  selector:
    matchLabels:
      app: mongo-shards
  template:
    metadata:
      labels:
        app: mongo-shards
    spec:
      containers:
        - name: mongo-shards
          image: 'pegasusio/mongo-shards:0.0.1'
          args:
            - mongo
            - '--config-opt-1'
            - 'valueConfOpt1'
            # [...] and as many args as needed to start properly mongo shard process
          ports:
            - containerPort: 27017
              protocol: TCP
          env:
            - name: MONGO_ENV_VAR_1
              value: mongorocks
           # use hashicorp vault secret management here, from 
           # kubernetes secret manager integration
            - name: MONGO_SECRET
              value: SD-6sfdsfggsTHGdfggsdfgjsd
          resources: {}
          volumeMounts:
            - name: data
              mountPath: /mongo-shards-data
          livenessProbe:
            httpGet:
              path: /test
              port: 27017
              scheme: HTTP
            initialDelaySeconds: 120
            timeoutSeconds: 1
            periodSeconds: 20
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /test
              port: 27017
              scheme: HTTP
            initialDelaySeconds: 120
            timeoutSeconds: 1
            periodSeconds: 20
            successThreshold: 1
            failureThreshold: 3
#           terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler

  # Volume Claims will be launched with scaling out shards with this stateful set , based on
  # this template (10 Gib each shard) : 
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: mongo-shard
        creationTimestamp: null
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
        volumeMode: Filesystem
      status:
        phase: Pending
  serviceName: mongo-shard
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
status:
  replicas: 4

Same statefulset / service duet for mongo-config-server-service.yaml, and mongo-config-server-statefulset.yaml

Jean-Baptiste-Lasselle commented 4 years ago

Prometheus

déployer pormetheus dans le cluster Kubernetes, en utilisant https://github.com/bibinwilson/kubernetes-prometheus
puis configurer la scrape_config pour que prometheus attrape les données du minio epxorter avec https://docs.min.io/docs/how-to-monitor-minio-using-prometheus.html

Jean-Baptiste-Lasselle commented 4 years ago

Service definitions

use type NodePort instead of type loadBalancer, which expects an external load-balancer
see https://kubernetes.io/docs/tasks/access-application-cluster/service-access-application-cluster/ to create the most basic NodePort service, and retrieve yaml from k8S dashboard
then load-balance using traefik as ingress

Jean-Baptiste-Lasselle commented 4 years ago

Helm chart for Mongo DB Sharded Cluster

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install mongodb -f values.yaml bitnami/mongodb-sharded --namespace mongodb

pegasus-io / a-k8s-demo

mongo scale #21

résumé

article trouvé sur stackexchange

So what does that mean for your example?

Brouillon

Prometheus

Service definitions

Helm chart for Mongo DB Sharded Cluster