influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.86k stars 3.55k forks source link

Replication issue #24117

Open Nameless80 opened 1 year ago

Nameless80 commented 1 year ago

Steps to reproduce: List the minimal actions needed to reproduce the behavior.

  1. Create simulation of 1800 variables ingested from a NodeRed OPC UA client over time periods of 2, 5 and 10 seconds (4 x WinCC OPA UA Server).
  2. Create the replication from local bucket (Windows Desktop - container) to remote bucket (NAS container)
  3. The Current Queues Bytes is cumulated, I noticed the variables been replicated to the remote Bucket but why the Current Queues Bytes do not reduced?

Expected behavior: the Current Queues Bytes is reduced after the variables been replicated to remote Bucket.

Actual behavior: The Current Queues Bytes is cumulated. image

noticed that "Current Queue Bytes" has been reduced from 10451563 to 58001, see attached image below. Here is the addition information: image

Environment info:

Questions:

  1. Can you explain how replication works? Why the queue never drops off since the data has been replicated to the remote bucket?
  2. How to control the accumulation of "Current Queue Bytes" and make it stable?
  3. How to monitor the health of the replication?
  4. Why is there such a big delay? Can you explain when the drop happens based on our replication configuration?
jeffreyssmith2nd commented 1 year ago

Replication works by mirroring any writes received to a local disk-based queue. A separate goroutine walks through that queue and writes it to cloud.

There is a periodic job that looks at the queue and deletes all data older than max-age. This purge interval runs every 1 minute and the max-age is 1 week by default. So, you can have data sit around for awhile before being purged. If you would like it to be trimmed sooner, you can lower that max-age with lower reliability.

For monitoring, the last status code is a good way to verify that things are being replicated. Alternatively if you query the replication via the API or using the --http-debug flag, you will see the remainingQueueSizeBytes field. This is the amount of bytes that have not been replicated (currentQueueSizeBytes is the size on disk). If remainingQueueSizeBytes is decreasing, or 0, then your replications are succeeding.

https://docs.influxdata.com/influxdb/v2.6/reference/cli/influx/replication/create/

EBS2324 commented 1 year ago

I do not see the remainingQueueSizeByte, when I do a influx replication list --http-debug --org xxx -t, I see this informations: { "replications": [ { "id": "0adef4660df7e000", "orgID": "ad8c24b24a7233fe", "name": "replication_130", "remoteID": "0adef405e0192000", "localBucketID": "92fac786165b762d", "remoteBucketID": null, "RemoteBucketName": "xxxxxxxx", "maxQueueSizeBytes": 1000000000, "currentQueueSizeBytes": 585025150, "latestResponseCode": 204, "latestErrorMessage": "", "dropNonRetryableData": false, "maxAgeSeconds": 8640000000000000 } ] As you see there are not remainingQueueSizeBytes, how can I get this information.

jeffreyssmith2nd commented 1 year ago

Ah my bad, I thought that code went into 2.6.1 but the PR was merged after 2.6.1 was released. The next version of influxdb will have some improvements around this to include showing the information in the CLI.

In the short term, you would have to run a master build to see. If you are seeing all of your data on the other side of the replication and seeing 204 in the latestResponseCode then you should have a healthy replication.

thisaraCX commented 1 year ago

I am not an expert on InfluxDB but I can explain some stuff based on the codebase and the way I understands it. I will answer the questions in the order its been asked.

1.Can you explain how replication works? Why the queue never drops off since the data has been replicated to the remote bucket?

How replication works is explained well by a comment above by @jeffreyssmith2nd

Replication works by mirroring any writes received to a local disk-based queue. A separate goroutine walks through that queue and writes it to cloud.

There is a periodic job that looks at the queue and deletes all data older than max-age. This purge interval runs every 1 minute and the max-age is 1 week by default. So, you can have data sit around for awhile before being purged. If you would like it to be trimmed sooner, you can lower that max-age with lower reliability.

For monitoring, the last status code is a good way to verify that things are being replicated. Alternatively if you query the replication via the API or using the --http-debug flag, you will see the remainingQueueSizeBytes field. This is the amount of bytes that have not been replicated (currentQueueSizeBytes is the size on disk). If remainingQueueSizeBytes is decreasing, or 0, then your replications are succeeding.

https://docs.influxdata.com/influxdb/v2.6/reference/cli/influx/replication/create/

I will add some missing bits of explanation. InfluxDB queue is saved to the disk, not to memory.

Case 01: Replication is healthy ( status code 204 and you see data on your replicated database )

The queue is saved using segmented chunks which are 10 MB each. InfluxDB queue management checks queue for purging every one minute as mentioned above and it only deletes segments ( 10 MB chunks ). If the segment is not full, it will not delete that segment, so on your disk, with every purge there still will be data remained thats less than 10 MB which is shown by currentQueueSizeBytes. So as I can see in your photos, your current queue sizer in bytes on the disk is always less than 10 MBs or slightly higher because the 1 minute purge timer might not have been reached.

Case 02: Replication is not healthy

In this situation currentQueueSizeBytes will grow till it reaches the size limit and discard the data in first in first out order. When the 1 minute purge timer checks the queue it will check if the data is being replicated successfully and if not it will check if the data is within the set max-age, if its older than the set max age, it will purge the data, otherwise it will keep the data.

2.How to control the accumulation of "Current Queue Bytes" and make it stable?

Hope my answer above clears this question. From the codebase I dont think you can change the segment size, default is being used which is 10 MB.

3. How to monitor the health of the replication?

Answered by a comment above

4. Why is there such a big delay? Can you explain when the drop happens based on our replication configuration?

This Im not quite sure. Depends on the situation. Sometimes when the queue gets full lags happen.

EBS2324 commented 1 year ago

We are having some replication issues, some LP are not been replicated (influxdb version 2.7). But we do not see any error in the influxdb log. This an example: Node 1 ,,1,2023-05-30T07:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T08:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T10:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T11:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC Node 2 ,,1,2023-05-30T07:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T08:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC ,,1,2023-05-30T11:00:00Z,100,value,XXX,YYY,AAA,NNN,BBB,CCC

We have the replication configuration with --no-drop-non-retryable-data. Is there a way to see the information in the queue to try to investigate what is happening? We are using the "influx replication list" command to monitor the replication, but no error has been detected ( e.g: "replications": [{"id": "0b45fcb1083f4000","orgID": "f1c1a59ddf65a579", "name": "my_replication_stream", "remoteID": "0b45fcaff4d37000", "localBucketID": "872d031ba055152b", "remoteBucketID": null, "RemoteBucketName": "our_metrics","maxQueueSizeBytes": 10737418240,"currentQueueSizeBytes": 10205395,"remainingBytesToBeSynced": 0, "latestResponseCode": 204, "latestErrorMessage": "", "dropNonRetryableData": false,"maxAgeSeconds": 604800 Is there another way to monitor the queue? Also in this ticket https://github.com/influxdata/influxdb/issues/22880 we have read: “t is possible that remote write errors will be encountered for which retrying does not provide any hope of success, such as a 400 error which means that the LP data could not be parsed” How it is possible a LP is parsed by the node 1 and not by the node 2.

EBS2324 commented 1 year ago

We have found the error, open the ticket https://github.com/influxdata/influxdb/issues/24263