Relocate volumes to NFS storage

larsks commented 3 years ago

Now that we have additional storage space available, should we relocate some of our larger volumes from the Ceph pool?

The distribution of volume sizes looks like:

If we just look at volumes >= 100GB, we see:

NAME                                                          SIZE
data-opf-observatorium-thanos-compact-0                       100000000
prometheus-odh-monitoring-db-prometheus-odh-monitoring-0      100000000
prometheus-odh-monitoring-db-prometheus-odh-monitoring-1      100000000
data-odh-message-bus-kafka-0                                  200000000
data-odh-message-bus-kafka-1                                  200000000
data-odh-message-bus-kafka-2                                  200000000
image-registry-storage                                        200000000
elasticsearch-elasticsearch-cd-ag7sr92w-1                     280000000
elasticsearch-elasticsearch-cd-ag7sr92w-2                     280000000
elasticsearch-elasticsearch-cdm-ipgbb3c0-1                    280000000
elasticsearch-elasticsearch-cdm-ipgbb3c0-2                    280000000
elasticsearch-elasticsearch-cdm-ipgbb3c0-3                    280000000

Do we want to move some of the larger volumes over to the NFS storage to relieve pressure on the ceph pool?

larsks commented 3 years ago

Also...do we want to make the NFS storage class the default?

tumido commented 3 years ago

Well.. thinks get a bit different perspective when we take into account allocations and the "really" used space as well.

This dashboard is WIP and shows correct data for PVCs only, the Object storage is not done yet: https://grafana-route-opf-monitoring.apps.zero.massopen.cloud/d/jQ-DdzCMk/tcoufal-pvc-ceph?orgId=1

Is there an easy way to migrate PVCs properly without service disruption? Do we have a process for that?

larsks commented 3 years ago

I don't there's going to be a way to transparently migrate data between storage classes. The process is going to look something like:

shut down service
run a pod that mounts both old and new pvs and copies the data
modify service to point to new pv
restart service
verify functionality
delete old pv

HumairAK commented 3 years ago

aside from the image-registry, the rest of the pvcs I think we can just re-create (this data should exist in object storage anyway afaik), we can probably also reduce the ES pvcs since I don't think they are being fully utilized.

larsks commented 3 years ago

Well.. thinks get a bit different perspective when we take into account allocations and the "really" used space as well.

I'm not sure that things are any different. It looks like OpenShift calculates storage availability based on the allocation size of the volume, not the actual amount of data.

tumido commented 3 years ago

I think we can revisit most of the PVC allocations and just make them fit their usage better... But that's not point of this issue, It's an aspect to consider to the future as well though.

Let's plan on relocating the PVCs for some of the applications.

Let's switch gears a bit: few questions to the NFS - if we want to switch over to it as a default storage class, we need to understand what does that mean compared to the Ceph storage in cluster.

What does it mean in the sense of performance and access times - I think I've heard something about NFS being generally slower, but how much? Is it sufficient for a DB for example?
What does it mean in the sense of reliability and data recovery. Ceph can recover data on a disk failure to some extent. What capabilities of this kind do we provide for the NFS?

larsks commented 3 years ago

What does it mean in the sense of performance and access times - I think I've heard something about NFS being generally slower, but how much? Is it sufficient for a DB for example?

We're going to find out! I mean, the major difference is that all i/o is going through a single interface on the server, rather than being distributed across a number of servers. In theory as the number of consumers goes up we may see more of an issue. We may be able to set up multiple connections between the server and the switch to alleviate this a bit.

Maybe we want to start by gathering some sort of performance statistics so that we have a baseline and can see how it changes over time.

What does it mean in the sense of reliability and data recovery. Ceph can recover data on a disk failure to some extent. What capabilities of this kind do we provide for the NFS?

In this case NFS itself isn't the issue, because NFS is just the network component (unlike Ceph, NFS doesn't manage the underlying storage at all).

https://docs.massopen.cloud/en/latest/operations/truenas-storage-for-zero-cluster.html has some details on the underlying filesystem organization. We're using ZFS with multiple RAIDZ2 vdevs. Each RAIDZ2 vdev can handle up to two disk failures, so across the entire server we could suffer up from 2 to 8 disk failures without an interruption in service (depending on which vdev the failures occurred).

ZFS is also somewhat resistant to bit rot, and can reconstruct files with errors using parity data.

larsks commented 3 years ago

Details on data integrity and RAIDZ from Wikipedia.

larsks commented 3 years ago

@tumido w/r/t to your question about performance, the problem is apparently the opposite of what we might expect. I ran a simple sysbench fileio benchmark on cephfs, ceph rbd, and nfs volumes, with the surprising result that our Ceph storage is apparently awful at the moment.

NFS results

sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!

File operations:
    reads/s:                      2895.07
    writes/s:                     1930.04
    fsyncs/s:                     6183.14

Throughput:
    read, MiB/s:                  45.24
    written, MiB/s:               30.16

General statistics:
    total time:                          10.0077s
    total number of events:              110066

Latency (ms):
         min:                                    0.00
         avg:                                    0.09
         max:                                   21.82
         95th percentile:                        0.57
         sum:                                 9952.98

Threads fairness:
    events (avg/stddev):           110066.0000/0.00
    execution time (avg/stddev):   9.9530/0.00

cephfs results

sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!

File operations:
    reads/s:                      41.58
    writes/s:                     27.79
    fsyncs/s:                     89.56

Throughput:
    read, MiB/s:                  0.65
    written, MiB/s:               0.43

General statistics:
    total time:                          10.0024s
    total number of events:              1462

Latency (ms):
         min:                                    0.00
         avg:                                    6.84
         max:                                  644.53
         95th percentile:                       30.81
         sum:                                 9998.99

Threads fairness:
    events (avg/stddev):           1462.0000/0.00
    execution time (avg/stddev):   9.9990/0.00

rbd results

sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Number of IO requests: 0
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random r/w test
Initializing worker threads...

Threads started!

File operations:
    reads/s:                      53.37
    writes/s:                     35.58
    fsyncs/s:                     122.95

Throughput:
    read, MiB/s:                  0.83
    written, MiB/s:               0.56

General statistics:
    total time:                          10.1153s
    total number of events:              2016

Latency (ms):
         min:                                    0.00
         avg:                                    5.00
         max:                                  382.40
         95th percentile:                        7.30
         sum:                                10080.91

Threads fairness:
    events (avg/stddev):           2016.0000/0.00
    execution time (avg/stddev):   10.0809/0.00

It took so long to complete benchmarking on the ceph volumes that I thought the benchmarking tool had frozen.

I guess we have something to explore next week.

tumido commented 3 years ago

Interesting... Yup, this seems like we have a bottleneck somewhere around the Ceph.

We currently don't have metrics on the storage usage from Prometheus, due to parallel work on metrics, but In general we can get some idea about what can be downsized and relocatedbased on the usage shown by OCP console.

I suggest to relocate the Cluster Logging storage to NFS - We can shrink it down by a great amount + relocate in one go. This PR should help: https://github.com/operate-first/apps/pull/623

Another good candidate would be Kafka: https://github.com/operate-first/apps/pull/624

tumido commented 3 years ago

Actually, I'm gonna convert this issue into an epic and we should scope out other relocation tasks from it instead of YOLO PRs as I just did...

larsks commented 3 years ago

I've opened https://access.redhat.com/support/cases/#/case/02937410 regarding the performance questions.

tumido commented 3 years ago

Resolved.

operate-first / support