Open Nameless80 opened 1 year ago
Replication works by mirroring any writes received to a local disk-based queue. A separate goroutine walks through that queue and writes it to cloud.
There is a periodic job that looks at the queue and deletes all data older than max-age
. This purge interval runs every 1 minute and the max-age is 1 week by default. So, you can have data sit around for awhile before being purged. If you would like it to be trimmed sooner, you can lower that max-age
with lower reliability.
For monitoring, the last status code is a good way to verify that things are being replicated. Alternatively if you query the replication via the API or using the --http-debug
flag, you will see the remainingQueueSizeBytes
field. This is the amount of bytes that have not been replicated (currentQueueSizeBytes
is the size on disk). If remainingQueueSizeBytes
is decreasing, or 0, then your replications are succeeding.
https://docs.influxdata.com/influxdb/v2.6/reference/cli/influx/replication/create/
I do not see the remainingQueueSizeByte, when I do a influx replication list --http-debug --org xxx -t, I see this informations: { "replications": [ { "id": "0adef4660df7e000", "orgID": "ad8c24b24a7233fe", "name": "replication_130", "remoteID": "0adef405e0192000", "localBucketID": "92fac786165b762d", "remoteBucketID": null, "RemoteBucketName": "xxxxxxxx", "maxQueueSizeBytes": 1000000000, "currentQueueSizeBytes": 585025150, "latestResponseCode": 204, "latestErrorMessage": "", "dropNonRetryableData": false, "maxAgeSeconds": 8640000000000000 } ] As you see there are not remainingQueueSizeBytes, how can I get this information.
Ah my bad, I thought that code went into 2.6.1 but the PR was merged after 2.6.1 was released. The next version of influxdb will have some improvements around this to include showing the information in the CLI.
In the short term, you would have to run a master build to see. If you are seeing all of your data on the other side of the replication and seeing 204
in the latestResponseCode
then you should have a healthy replication.
I am not an expert on InfluxDB but I can explain some stuff based on the codebase and the way I understands it. I will answer the questions in the order its been asked.
1.Can you explain how replication works? Why the queue never drops off since the data has been replicated to the remote bucket?
How replication works is explained well by a comment above by @jeffreyssmith2nd
Replication works by mirroring any writes received to a local disk-based queue. A separate goroutine walks through that queue and writes it to cloud.
There is a periodic job that looks at the queue and deletes all data older than
max-age
. This purge interval runs every 1 minute and the max-age is 1 week by default. So, you can have data sit around for awhile before being purged. If you would like it to be trimmed sooner, you can lower thatmax-age
with lower reliability.For monitoring, the last status code is a good way to verify that things are being replicated. Alternatively if you query the replication via the API or using the
--http-debug
flag, you will see theremainingQueueSizeBytes
field. This is the amount of bytes that have not been replicated (currentQueueSizeBytes
is the size on disk). IfremainingQueueSizeBytes
is decreasing, or 0, then your replications are succeeding.https://docs.influxdata.com/influxdb/v2.6/reference/cli/influx/replication/create/
I will add some missing bits of explanation. InfluxDB queue is saved to the disk, not to memory.
Case 01: Replication is healthy ( status code 204 and you see data on your replicated database )
The queue is saved using segmented chunks which are 10 MB each. InfluxDB queue management checks queue for purging every one minute as mentioned above and it only deletes segments ( 10 MB chunks ). If the segment is not full, it will not delete that segment, so on your disk, with every purge there still will be data remained thats less than 10 MB which is shown by currentQueueSizeBytes
. So as I can see in your photos, your current queue sizer in bytes on the disk is always less than 10 MBs or slightly higher because the 1 minute purge timer might not have been reached.
Case 02: Replication is not healthy
In this situation currentQueueSizeBytes
will grow till it reaches the size limit and discard the data in first in first out order. When the 1 minute purge timer checks the queue it will check if the data is being replicated successfully and if not it will check if the data is within the set max-age
, if its older than the set max age, it will purge the data, otherwise it will keep the data.
2.How to control the accumulation of "Current Queue Bytes" and make it stable?
Hope my answer above clears this question. From the codebase I dont think you can change the segment size, default is being used which is 10 MB.
3. How to monitor the health of the replication?
Answered by a comment above
4. Why is there such a big delay? Can you explain when the drop happens based on our replication configuration?
This Im not quite sure. Depends on the situation. Sometimes when the queue gets full lags happen.
We are having some replication issues, some LP are not been replicated (influxdb version 2.7). But we do not see any error in the influxdb log. This an example: Node 1 ,,1,2023-05-30T07:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T08:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T10:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T11:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC Node 2 ,,1,2023-05-30T07:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T08:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T11:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC
We have the replication configuration with --no-drop-non-retryable-data. Is there a way to see the information in the queue to try to investigate what is happening? We are using the "influx replication list" command to monitor the replication, but no error has been detected ( e.g: "replications": [{"id": "0b45fcb1083f4000","orgID": "f1c1a59ddf65a579", "name": "my_replication_stream", "remoteID": "0b45fcaff4d37000", "localBucketID": "872d031ba055152b", "remoteBucketID": null, "RemoteBucketName": "our_metrics","maxQueueSizeBytes": 10737418240,"currentQueueSizeBytes": 10205395,"remainingBytesToBeSynced": 0, "latestResponseCode": 204, "latestErrorMessage": "", "dropNonRetryableData": false,"maxAgeSeconds": 604800 Is there another way to monitor the queue? Also in this ticket https://github.com/influxdata/influxdb/issues/22880 we have read: “t is possible that remote write errors will be encountered for which retrying does not provide any hope of success, such as a 400 error which means that the LP data could not be parsed” How it is possible a LP is parsed by the node 1 and not by the node 2.
We have found the error, open the ticket https://github.com/influxdata/influxdb/issues/24263
Steps to reproduce: List the minimal actions needed to reproduce the behavior.
Expected behavior: the Current Queues Bytes is reduced after the variables been replicated to remote Bucket.
Actual behavior: The Current Queues Bytes is cumulated.
noticed that "Current Queue Bytes" has been reduced from 10451563 to 58001, see attached image below. Here is the addition information:
Environment info:
Questions: