m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

tracker TestExpiration may be unreliable / flaky #337

Closed stephen-soltesz closed 2 years ago

stephen-soltesz commented 3 years ago

During cloud build, the etl gardener go test -race step for the tracker package failed on two builds, and succeeded on a third.

This is a possible race condition.

Step #3 - "Run all gardener unit tests": === RUN   TestExpiration
Step #3 - "Run all gardener unit tests": 2021/08/05 17:18:51 tracker.go:102: DEBUG: Skipping save 2021-08-05 17:18:50.61171094 +0000 UTC m=+1.054659773 2021-08-05 17:18:50.628750616 +0000 UTC m=+1.071699515
Step #3 - "Run all gardener unit tests": 2021/08/05 17:18:51 tracker.go:58: datastore: no such entity /TestExpiration,jobs
Step #3 - "Run all gardener unit tests": 2021/08/05 17:18:52 tracker_test.go:294: job already exists
Step #3 - "Run all gardener unit tests":     tracker_test.go:28: job already exists
Step #3 - "Run all gardener unit tests": --- FAIL: TestExpiration (0.13s)
Step #3 - "Run all gardener unit tests": 2021/08/05 17:18:52 tracker.go:102: DEBUG: Skipping save 2021-08-05 17:18:50.61171094 +0000 UTC m=+1.054659773 2021-08-05 17:18:50.628750616 +0000 UTC m=+1.071699515
Step #3 - "Run all gardener unit tests": 2021/08/05 17:18:52 tracker.go:290: Deleting stale job 20110101:exp/type 127.48681ms 1ms
Step #3 - "Run all gardener unit tests": FAIL
Step #3 - "Run all gardener unit tests": FAIL   github.com/m-lab/etl-gardener/tracker   2.475s
stephen-soltesz commented 3 years ago

However, I cannot reproduce this locally using:

while go test -count=1 -v ./tracker/... ./ops/... -race ; do sleep .1 ; done
stephen-soltesz commented 2 years ago

Another potentially flaky failure in cloud/bq/sanity*

Step #3 - "Run all gardener unit tests": ?      github.com/m-lab/etl-gardener/cloud [no test files]
Step #3 - "Run all gardener unit tests": === RUN   Test_getTableParts
Step #3 - "Run all gardener unit tests": --- PASS: Test_getTableParts (0.00s)
Step #3 - "Run all gardener unit tests": === RUN   TestSanityCheckAndCopy
Step #3 - "Run all gardener unit tests": 2021/11/18 02:00:48 sanity.go:207: googleapi: Error 400: Cannot parse  as CloudRegion., badRequest
Step #3 - "Run all gardener unit tests": 2021/11/18 02:00:48 sanity.go:208: Query: 
Step #3 - "Run all gardener unit tests":        #standardSQL
Step #3 - "Run all gardener unit tests":        SELECT COUNT(DISTINCT test_id) AS TestCount, COUNT(DISTINCT task_filename) AS TaskFileCount
Step #3 - "Run all gardener unit tests":     FROM `dataset.foo_19990101`
Step #3 - "Run all gardener unit tests":          -- where clause
Step #3 - "Run all gardener unit tests": 2021/11/18 02:00:48 sanity.go:113: project:dataset.foo_19990101 foo_19990101
Step #3 - "Run all gardener unit tests":     sanity_test.go:80: googleapi: Error 400: Cannot parse  as CloudRegion., badRequest
Step #3 - "Run all gardener unit tests": --- FAIL: TestSanityCheckAndCopy (0.64s)

Succeeds on retry.

stephen-soltesz commented 2 years ago

TheTestSanityCheckAndCopy is still flaky...