gnocchixyz / gnocchi

Timeseries database
Apache License 2.0
298 stars 84 forks source link

Address high IOPs usage of the Gnocchi Ceph pool #1381

Closed rafaelweingartner closed 1 month ago

rafaelweingartner commented 4 months ago

Before I describe the situation, let me put us all in the same page regarding the concepts we are dealing with. This patch is about the backend Ceph that can be used to store processed and raw measurements for Gnocchi. Ceph is a software defined storage, which when deployed implements a concept called Reliable Autonomous Distributed Object Store (RADOS). Do not confuse this RADOS with RadosGW, which is the sub-system that implements S3 API to consume a Ceph backend. In Ceph we have the RADOS objects, which are different from RadosGW objects (S3 objects).The RADOS objects depend on the underlying system that is consuming them; and, they (the RADOS objects) are the building blocks of any Ceph cluster. For instance, when using Rados Block Device (RBD), the libRBD or the KRBD use by default 4MiB Rados object size. Each IOP shown by Ceph is an operation either read or write of a RADOS object. The RADOS object can be customized/used differently depending on the system that consumes Ceph.

Different from systems that consume Ceph via a standard protocol such as RBD or CephFS (which mounts a Ceph pool as a POSIX file system), Gnocchi consumes Ceph natively; I mean, Gnocchi interacts directly with the low level RADOS objects. Every metric (either processed or raw) are stored in a single RADOS object; processed metrics are stored in different files according to their time frames (time splits). Differently from other systems where there is a standard size for the RADOS objects, Gnocchi handles the files in an isolated fashion. Therefore, for some metrics there are RADOS object bigger or smaller depending on the volume of data we have for the given metric and time-frame.

Gnocchi uses LIBRADOS [1] to interact with a Ceph backend. When writing a raw metric, Gnocchi uses the method [2], which writes all dataset in a RADOS object. That write represents (is counted by Ceph) one (1) IOP operation; it does not matter if it is a dataset of 1k, 1M, or 10MB, it will be a single write operation. On the other hand, when reading, Gnocchi uses the method [3]; as one can see, the read operation does not read the complete file in a single operation. It will read the data in pieces, and the default chunk size is 8k. This can cause high READ IOPs in certain cases, such as when we have raw metrics for a one year backwindow.

The proposal to address this situation is to add an adaptative read process for Gnocchi when it uses Ceph as a backend. I mean, we store the size of the RADOS file for each metric, and then we use the size of the file to configure the read buffer. This will make Gnocchi to reduce the number of read operations in the Ceph cluster.

The following picture demonstrates the difference between the standard Gnocchi Ceph code, and the proposed solution. Furthermore, in beige color, there is an example of a further improvement, which is achieved together with this code and some tuning such as disabling the "greedy" option in Gnocchi and increasing the interval between MetricD processing from 60s to 300s.

Screenshot from 2024-03-27 14-13-41

The spikes shown in the picture, which are highlighted with a star are a consequence of the code. I mean, in the worst case scenario, in the first run, the system will not have "learned" the RADOS object size. Therefore, the read is not optimal. After the first round of processing, the system will learn the pattern, and then the reads are improved.

[1] https://docs.ceph.com/en/latest/rados/api/python/ [2] https://docs.ceph.com/en/latest/rados/api/python#rados.Ioctx.write_full [3] https://docs.ceph.com/en/latest/rados/api/python/#rados.Ioctx.read

rafaelweingartner commented 2 months ago

Hello @jd and @chungg we have interesting new patches that might be worth for you guys to take a look at. This one, for instance, provides a great benefits for people using gnocchi with a Ceph backend.

rafaelweingartner commented 2 months ago

thanks! this makes sense to me. will let more active members merge (or will merge if no one else does).

Awesome! Thanks for your review!

rafaelweingartner commented 1 month ago

@tobias-urdin, thanks for the support here!