influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.82k stars 3.55k forks source link

Backing up fails due to a shard error #21743

Open bangeneticalgorithms opened 3 years ago

bangeneticalgorithms commented 3 years ago

Steps to reproduce: List the minimal actions needed to reproduce the behavior.

  1. Populate an influxdb instance
  2. Run influx backup
  3. ...

Expected behavior: Backup completes by downloading data from all shards

Actual behavior: Downloading the backup tar.gz files error out periodically when trying to download a specific shard. The behavior seems to be non determinustic

Environment info:

Config: Copy any non-default config values here or attach the full config as a gist or file.

Logs: 2021-06-24T22:15:48.990826Z info Backing up shard {"log_id": "0Ux1LLx0000", "id": 699, "path": "backup.dir/2021-06-24_test/20210624T221548Z.s699.tar.gz"} 2021-06-24T22:15:48.994955Z info Backing up shard {"log_id": "0Ux1LLx0000", "id": 740, "path": "backup.dir/2021-06-24_test/20210624T221548Z.s740.tar.gz"} 2021-06-24T22:15:49.000888Z info Backing up shard {"log_id": "0Ux1LLx0000", "id": 781, "path": "backup.dir/2021-06-24_test/20210624T221548Z.s781.tar.gz"} 2021-06-24T22:15:49.004757Z info Backing up shard {"log_id": "0Ux1LLx0000", "id": 822, "path": "backup.dir/2021-06-24_test/20210624T221548Z.s822.tar.gz"} 2021-06-24T22:15:49.008300Z info Backing up shard {"log_id": "0Ux1LLx0000", "id": 863, "path": "backup.dir/2021-06-24_test/20210624T221548Z.s863.tar.gz"} 2021-06-24T22:15:49.012714Z info Backing up shard {"log_id": "0Ux1LLx0000", "id": 906, "path": "backup.dir/2021-06-24_test/20210624T221548Z.s906.tar.gz"} 2021-06-24T22:15:49.016365Z info Backing up shard {"log_id": "0Ux1LLx0000", "id": 947, "path": "backup.dir/2021-06-24_test/20210624T221548Z.s947.tar.gz"} Error: Failed to download shard backup: An internal error has occurred.

Performance: Generate profiles with the following commands for bugs related to performance, locking, out of memory (OOM), etc.

# Commands should be run when the bug is actively happening.
# Note: This command will run for ~30 seconds.
curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=30s"
iostat -xd 1 30 > iostat.txt
# Attach the `profiles.tar.gz` and `iostat.txt` output files.

This is possibly a duplicate of #16739. But this bug is for influxd 2.0

danxmoran commented 3 years ago

@bangeneticalgorithms do you have access to the server logs for when these errors are occurring? If so, could you paste them here? I expect they'll show more useful info than the plain An internal error has occurred the CLI is printing

noose commented 3 years ago

I have the same issue.

No - there is nothing more usesful.

ts=2021-07-13T07:00:36.263952Z lvl=info msg="Cache snapshot (start)" log_id=0VJbwT40000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
ts=2021-07-13T07:00:36.265756Z lvl=info msg="Cache snapshot (end)" log_id=0VJbwT40000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=1.768ms
ts=2021-07-13T07:00:36.266112Z lvl=debug msg=Request log_id=0VJbwT40000 service=http method=GET host=localhost:8086 path=/api/v2/backup/shards/19854 query= proto=HTTP/1.1 status_code=500 response_size=68 content_length=0 referrer= remote=127.0.0.1:34122 user_agent=Go-http-client took=2.706ms error="internal error" error_code="internal error" body=
lesam commented 3 years ago

@bangeneticalgorithms @noose The linked 1.x issue seems to be caused by running the server on a filesystem that does not support hardlinks. Do you know if the filesystem you are running the server on supports hardlinks?

lesam commented 3 years ago

Probably fixed by https://github.com/influxdata/influxdb/issues/22446 , in that if you have that change you should get a good error message in the logs instead of the unhelpful 'internal error'.