grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.17k stars 537 forks source link

Flaky `TestDistributorHighConcurrency` #7048

Open dimitarvdimitrov opened 10 months ago

dimitarvdimitrov commented 10 months ago

Example failure

Details

``` === RUN TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false 01:16:40 Starting minio-9000 01:16:40 Starting consul 01:16:40 consul: ==> Starting Consul agent... 01:16:40 consul: Version: '1.8.15' 01:16:40 consul: Node ID: 'bdd6e52f-c053-d1fb-faf1-3936b37334ff' 01:16:40 consul: Node name: 'consul' 01:16:40 consul: Datacenter: 'dc1' (Segment: '') 01:16:40 consul: Server: true (Bootstrap: false) 01:16:40 consul: Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600) 01:16:40 consul: Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302) 01:16:40 consul: Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false 01:16:40 consul: ==> Log data will now stream in as it occurs: 01:16:40 Ports for container: e2e-mimir-test-consul Mapping: map[8500:32785] 01:16:40 Started consul 01:16:40 Ports for container: e2e-mimir-test-minio-9000 Mapping: map[9000:32786] 01:16:40 Started minio-9000 01:16:42 Starting querier 01:16:42 Starting distributor 01:16:42 Starting ingester 01:16:43 Ports for container: e2e-mimir-test-distributor Mapping: map[8080:32787] 01:16:43 Started distributor 01:16:43 Ports for container: e2e-mimir-test-querier Mapping: map[8080:32788] 01:16:43 Started querier 01:16:43 Ports for container: e2e-mimir-test-ingester Mapping: map[8080:32789] 01:16:43 Started ingester distributor_high_concurrency_test.go:123: Error Trace: /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:123 /opt/hostedtoolcache/go/1.21.5/x64/src/runtime/asm_amd64.s:1650 Error: Not equal: expected: 200 actual : 400 Test: TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false Messages: 400 Bad Request 01:16:44 ingester: ts=2024-01-04T01:16:44.53352665Z caller=grpc_logging.go:66 level=warn method=/cortex.Ingester/Push duration=9.211452ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-04T00:40:00Z and is from series series_9" 01:16:44 distributor: ts=2024-01-04T01:16:44.537758655Z caller=push.go:171 level=error user=e2e-user msg="push error" err="rpc error: code = Code(400) desc = failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-04T00:40:00Z and is from series series_9" 01:16:58 querier: ts=2024-01-04T01:16:58.34791[197](https://github.com/grafana/mimir/actions/runs/7404278739/job/20145610140?pr=7046#step:15:198)9Z caller=loader.go:116 level=warn msg="bucket index not found" user=e2e-user 01:16:58 Killing ingester 01:16:58 Killing querier 01:16:58 Killing consul 01:16:58 Killing minio-9000 01:16:58 Killing distributor 01:16:59 Killed consul 01:16:59 Killed ingester 01:16:59 Killed minio-9000 01:16:59 Killed querier 01:16:59 Killed distributor ```

dimitarvdimitrov commented 10 months ago

another two instances that happened in the same CI run

=== RUN   TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=false
11:58:01 Started distributor
    distributor_high_concurrency_test.go:123: 
            Error Trace:    /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:123
                                        /opt/hostedtoolcache/go/1.21.6/x64/src/runtime/asm_amd64.s:[165](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:166)0
            Error:          Not equal: 
                            expected: 200
                            actual  : 400
            Test:           TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=false
            Messages:       400 Bad Request
11:58:03 ingester: ts=2024-01-25T11:58:03.226590875Z caller=grpc_logging.go:66 level=warn method=/cortex.Ingester/Push duration=29.693378ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_18"
11:58:03 distributor: ts=2024-01-25T11:58:03.230084908Z caller=push.go:[171](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:172) level=error user=e2e-user msg="push error" err="failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_18"
11:58:17 querier: ts=2024-01-25T11:58:17.467712012Z caller=loader.go:116 level=warn msg="bucket index not found" user=e2e-user
11:58:17 Killing distributor
=== RUN   TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=true
11:58:22 Started distributor
    distributor_high_concurrency_test.go:123: 
            Error Trace:    /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:123
                                        /opt/hostedtoolcache/go/1.21.6/x64/src/runtime/asm_amd64.s:1650
            Error:          Not equal: 
                            expected: [200](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:201)
                            actual  : 400
            Test:           TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=true
            Messages:       400 Bad Request
11:58:22 distributor: ts=[202](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:203)4-01-25T11:58:22.848469548Z caller=push.go:171 level=error user=e2e-user msg="push error" err="failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_17"
11:58:22 ingester: ts=2024-01-25T11:58:22.845433314Z caller=grpc_logging.go:66 level=warn method=/cortex.Ingester/Push duration=29.989559ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_17"
11:58:37 querier: ts=2024-01-25T11:58:37.722351814Z caller=loader.go:116 level=warn msg="bucket index not found" user=e2e-user
11:58:38 Killing distributor
charleskorn commented 7 months ago

Another example

pstibrany commented 7 months ago

The test works like this: it computes 20 minute window in which it will write 25 independant series, each with 1000 samples, spaced regularly (1.2s between sample).

20 minute window is computed as:

From the logs we can see that all failures fail to write sample with "start" timestamp:

Theory: I suspect that writing previous series somehow bumps min valid time after "start", but I don't quite understand why that would happen.

charleskorn commented 5 months ago

Another example

krajorama commented 5 months ago

One more

gotjosh commented 5 months ago

One more example failure

dimitarvdimitrov commented 3 months ago

still happening

dimitarvdimitrov commented 2 months ago

happened again

Details

``` === RUN TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false 17:33:41 Starting minio-9000 17:33:41 Starting consul 17:33:41 consul: ==> Starting Consul agent... 17:33:41 consul: Version: '1.8.15' 17:33:41 consul: Node ID: 'cf6bf76d-4c43-66ca-60f0-1e338ac1c01c' 17:33:41 consul: Node name: 'consul' 17:33:41 consul: Datacenter: 'dc1' (Segment: '') 17:33:41 consul: Server: true (Bootstrap: false) 17:33:41 consul: Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600) 17:33:41 consul: Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302) 17:33:41 consul: Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false 17:33:41 consul: ==> Log data will now stream in as it occurs: 17:33:41 Ports for container: e2e-mimir-test-minio-9000 Mapping: map[9000:32792] 17:33:41 Started minio-9000 17:33:41 Ports for container: e2e-mimir-test-consul Mapping: map[8500:32791] 17:33:41 Started consul 17:33:43 Starting querier 17:33:43 Starting distributor 17:33:43 Starting ingester 17:33:43 Ports for container: e2e-mimir-test-distributor Mapping: map[8080:32793] 17:33:43 Started distributor 17:33:43 Ports for container: e2e-mimir-test-querier Mapping: map[8080:32794] 17:33:43 Started querier 17:33:44 Ports for container: e2e-mimir-test-ingester Mapping: map[8080:32795] 17:33:44 Started ingester distributor_high_concurrency_test.go:122: Error Trace: /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:122 /opt/hostedtoolcache/go/1.23.1/x64/src/runtime/asm_amd64.s:1700 Error: Not equal: expected: 200 actual : 400 Test: TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false Messages: 400 Bad Request 17:33:45 distributor: ts=2024-09-27T17:33:45.295229642Z caller=push.go:211 level=error user=e2e-user msg="detected an error while ingesting Prometheus remote-write request (the request may have been partially ingested)" httpCode=400 err="send data to ingesters: failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-27T17:00:00Z and is from series series_2 (sampled 1/10)" insight=true 17:33:45 ingester: ts=2024-09-27T17:33:45.291948928Z caller=grpc_logging.go:76 level=warn method=/cortex.Ingester/Push duration=28.458284ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-09-27T17:00:00Z and is from series series_2 (sampled 1/10)" 17:33:59 querier: ts=2024-09-27T17:33:59.87882515Z caller=loader.go:120 level=warn msg="bucket index not found" user=e2e-user 17:34:00 Killing ingester 17:34:00 Killing consul 17:34:00 Killing distributor 17:34:00 Killing minio-9000 17:34:00 Killing querier 17:34:00 Killed consul 17:34:00 Killed minio-9000 17:34:00 Killed ingester 17:34:00 Killed querier 17:34:00 Killed distributor ```

charleskorn commented 1 month ago

And again

dimitarvdimitrov commented 1 month ago
Details

``` === RUN TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false 12:20:16 Starting minio-9000 12:20:16 Starting consul 12:20:16 consul: ==> Starting Consul agent... 12:20:16 consul: Version: '1.8.15' 12:20:16 consul: Node ID: '78801d03-61fe-b399-2727-368667a453b7' 12:20:16 consul: Node name: 'consul' 12:20:16 consul: Datacenter: 'dc1' (Segment: '') 12:20:16 consul: Server: true (Bootstrap: false) 12:20:16 consul: Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600) 12:20:16 consul: Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302) 12:20:16 consul: Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false 12:20:16 consul: ==> Log data will now stream in as it occurs: 12:20:17 Ports for container: e2e-mimir-test-consul Mapping: map[8500:32792] 12:20:17 Started consul 12:20:17 Ports for container: e2e-mimir-test-minio-9000 Mapping: map[9000:32791] 12:20:17 Started minio-9000 12:20:18 Starting querier 12:20:18 Starting distributor 12:20:18 Starting ingester 12:20:19 Ports for container: e2e-mimir-test-querier Mapping: map[8080:32793] 12:20:19 Started querier 12:20:19 Ports for container: e2e-mimir-test-ingester Mapping: map[8080:32795] 12:20:19 Started ingester 12:20:19 Ports for container: e2e-mimir-test-distributor Mapping: map[8080:32794] 12:20:19 Started distributor distributor_high_concurrency_test.go:122: Error Trace: /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:122 /opt/hostedtoolcache/go/1.23.2/x64/src/runtime/asm_amd64.s:1700 Error: Not equal: expected: 200 actual : 400 Test: TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false Messages: 400 Bad Request distributor_high_concurrency_test.go:122: Error Trace: /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:122 /opt/hostedtoolcache/go/1.23.2/x64/src/runtime/asm_amd64.s:1700 Error: Not equal: expected: 200 actual : 400 Test: TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false Messages: 400 Bad Request 12:20:21 distributor: ts=2024-10-15T12:20:21.058813748Z caller=push.go:221 level=error user=e2e-user msg="detected an error while ingesting Prometheus remote-write request (the request may have been partially ingested)" httpCode=400 err="send data to ingesters: failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-10-15T12:00:00Z and is from series series_3 (sampled 1/10)" insight=true 12:20:21 distributor: ts=2024-10-15T12:20:21.060842093Z caller=push.go:221 level=error user=e2e-user traceID=321aca90889fb3d0 msg="detected an error while ingesting Prometheus remote-write request (the request may have been partially ingested)" httpCode=400 err="send data to ingesters: failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-10-15T12:00:00Z and is from series series_21 (sampled 1/10)" insight=true 12:20:21 ingester: ts=2024-10-15T12:20:21.056706556Z caller=grpc_logging.go:76 level=warn method=/cortex.Ingester/Push duration=26.222634ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-10-15T12:00:00Z and is from series series_3 (sampled 1/10)" 12:20:34 querier: ts=2024-10-15T12:20:34.78086082Z caller=loader.go:120 level=warn msg="bucket index not found" user=e2e-user 12:20:35 Killing distributor 12:20:35 Killing consul 12:20:35 Killing querier 12:20:35 Killing ingester 12:20:35 Killing minio-9000 12:20:35 Killed consul 12:20:35 Killed querier 12:20:35 Killed minio-9000 12:20:35 Killed ingester 12:20:35 Killed distributor ```