Open dimitarvdimitrov opened 10 months ago
another two instances that happened in the same CI run
=== RUN TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=false
11:58:01 Started distributor
distributor_high_concurrency_test.go:123:
Error Trace: /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:123
/opt/hostedtoolcache/go/1.21.6/x64/src/runtime/asm_amd64.s:[165](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:166)0
Error: Not equal:
expected: 200
actual : 400
Test: TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=false
Messages: 400 Bad Request
11:58:03 ingester: ts=2024-01-25T11:58:03.226590875Z caller=grpc_logging.go:66 level=warn method=/cortex.Ingester/Push duration=29.693378ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_18"
11:58:03 distributor: ts=2024-01-25T11:58:03.230084908Z caller=push.go:[171](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:172) level=error user=e2e-user msg="push error" err="failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_18"
11:58:17 querier: ts=2024-01-25T11:58:17.467712012Z caller=loader.go:116 level=warn msg="bucket index not found" user=e2e-user
11:58:17 Killing distributor
=== RUN TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=true
11:58:22 Started distributor
distributor_high_concurrency_test.go:123:
Error Trace: /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:123
/opt/hostedtoolcache/go/1.21.6/x64/src/runtime/asm_amd64.s:1650
Error: Not equal:
expected: [200](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:201)
actual : 400
Test: TestDistributorHighConcurrency/caching_unmarshal_data=false,_pooling_write_requests=true
Messages: 400 Bad Request
11:58:22 distributor: ts=[202](https://github.com/grafana/mimir/actions/runs/7654088514/job/20857423630?pr=7205#step:15:203)4-01-25T11:58:22.848469548Z caller=push.go:171 level=error user=e2e-user msg="push error" err="failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_17"
11:58:22 ingester: ts=2024-01-25T11:58:22.845433314Z caller=grpc_logging.go:66 level=warn method=/cortex.Ingester/Push duration=29.989559ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-25T11:20:00Z and is from series series_17"
11:58:37 querier: ts=2024-01-25T11:58:37.722351814Z caller=loader.go:116 level=warn msg="bucket index not found" user=e2e-user
11:58:38 Killing distributor
The test works like this: it computes 20 minute window in which it will write 25 independant series, each with 1000 samples, spaced regularly (1.2s between sample).
20 minute window is computed as:
From the logs we can see that all failures fail to write sample with "start" timestamp:
ts=2024-01-04T01:16:44.537758655Z ... the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-04T00:40:00Z and is from series series_9"
ts=2024-04-09T01:51:29.149945902Z ... the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-04-09T01:20:00Z and is from series series_11 (sampled 1/10)"
Theory: I suspect that writing previous series somehow bumps min valid time after "start", but I don't quite understand why that would happen.
still happening
happened again
```
=== RUN TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false
17:33:41 Starting minio-9000
17:33:41 Starting consul
17:33:41 consul: ==> Starting Consul agent...
17:33:41 consul: Version: '1.8.15'
17:33:41 consul: Node ID: 'cf6bf76d-4c43-66ca-60f0-1e338ac1c01c'
17:33:41 consul: Node name: 'consul'
17:33:41 consul: Datacenter: 'dc1' (Segment: '
```
=== RUN TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false
12:20:16 Starting minio-9000
12:20:16 Starting consul
12:20:16 consul: ==> Starting Consul agent...
12:20:16 consul: Version: '1.8.15'
12:20:16 consul: Node ID: '78801d03-61fe-b399-2727-368667a453b7'
12:20:16 consul: Node name: 'consul'
12:20:16 consul: Datacenter: 'dc1' (Segment: '
Example failure
Details
``` === RUN TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false 01:16:40 Starting minio-9000 01:16:40 Starting consul 01:16:40 consul: ==> Starting Consul agent... 01:16:40 consul: Version: '1.8.15' 01:16:40 consul: Node ID: 'bdd6e52f-c053-d1fb-faf1-3936b37334ff' 01:16:40 consul: Node name: 'consul' 01:16:40 consul: Datacenter: 'dc1' (Segment: '')
01:16:40 consul: Server: true (Bootstrap: false)
01:16:40 consul: Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
01:16:40 consul: Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
01:16:40 consul: Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
01:16:40 consul: ==> Log data will now stream in as it occurs:
01:16:40 Ports for container: e2e-mimir-test-consul Mapping: map[8500:32785]
01:16:40 Started consul
01:16:40 Ports for container: e2e-mimir-test-minio-9000 Mapping: map[9000:32786]
01:16:40 Started minio-9000
01:16:42 Starting querier
01:16:42 Starting distributor
01:16:42 Starting ingester
01:16:43 Ports for container: e2e-mimir-test-distributor Mapping: map[8080:32787]
01:16:43 Started distributor
01:16:43 Ports for container: e2e-mimir-test-querier Mapping: map[8080:32788]
01:16:43 Started querier
01:16:43 Ports for container: e2e-mimir-test-ingester Mapping: map[8080:32789]
01:16:43 Started ingester
distributor_high_concurrency_test.go:123:
Error Trace: /home/runner/work/mimir/mimir/integration/distributor_high_concurrency_test.go:123
/opt/hostedtoolcache/go/1.21.5/x64/src/runtime/asm_amd64.s:1650
Error: Not equal:
expected: 200
actual : 400
Test: TestDistributorHighConcurrency/caching_unmarshal_data=true,_pooling_write_requests=false
Messages: 400 Bad Request
01:16:44 ingester: ts=2024-01-04T01:16:44.53352665Z caller=grpc_logging.go:66 level=warn method=/cortex.Ingester/Push duration=9.211452ms msg=gRPC err="user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-04T00:40:00Z and is from series series_9"
01:16:44 distributor: ts=2024-01-04T01:16:44.537758655Z caller=push.go:171 level=error user=e2e-user msg="push error" err="rpc error: code = Code(400) desc = failed pushing to ingester ingester: user=e2e-user: the sample has been rejected because its timestamp is too old (err-mimir-sample-timestamp-too-old). The affected sample has timestamp 2024-01-04T00:40:00Z and is from series series_9"
01:16:58 querier: ts=2024-01-04T01:16:58.34791[197](https://github.com/grafana/mimir/actions/runs/7404278739/job/20145610140?pr=7046#step:15:198)9Z caller=loader.go:116 level=warn msg="bucket index not found" user=e2e-user
01:16:58 Killing ingester
01:16:58 Killing querier
01:16:58 Killing consul
01:16:58 Killing minio-9000
01:16:58 Killing distributor
01:16:59 Killed consul
01:16:59 Killed ingester
01:16:59 Killed minio-9000
01:16:59 Killed querier
01:16:59 Killed distributor
```