treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.44k stars 352 forks source link

[Bug]: Flakey ESTI tests #8139

Closed nadavsteindler closed 4 weeks ago

nadavsteindler commented 2 months ago

What happened?

ESTI tests are flakey and often fail developer's pipelines

log attached logs_27863969862.zip

logs_27863969862 (1).zip

On 4/9/2024 I had a number of failures in a row. Last one had this log i.e. the last test passed and then the esti process failed! logs_27919722201.zip

esti-1  | === RUN   TestLakectlFsUpload/dir_without_recursive_to_file
esti-1  |     lakectl_test.go:594: Run shell command 'LAKECTL_CREDENTIALS_ACCESS_KEY_ID=*** LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY=*** LAKECTL_SERVER_ENDPOINT_URL=https://ayti-ci-3727-0a9b9e--esti.us-east-1.lakefscloud.ninja LAKECTL_EXPERIMENTAL_LOCAL_POSIX_PERMISSIONS_ENABLED=false /go/lakeFS/lakectl fs upload -s files/ lakefs://repo-crc0uho5o9spm1tp0k4g/main/data/ro/1.txt'
esti-1  |     lakectl_test.go:594: 
esti-1  |           Error Trace:    /go/lakeFS/esti/lakectl_util.go:172
esti-1  |                                       /go/lakeFS/esti/lakectl_util.go:163
esti-1  |                                       /go/lakeFS/esti/lakectl_test.go:594
esti-1  |           Error:          "\n502 Bad Gateway\n" does not contain "read files/: is a directory"
esti-1  |           Test:           TestLakectlFsUpload/dir_without_recursive_to_file
esti-1  | --- FAIL: TestLakectlFsUpload (2.45s)
esti-1  |     --- PASS: TestLakectlFsUpload/single_file (0.12s)
esti-1  |     --- PASS: TestLakectlFsUpload/single_file_with_separator (0.06s)
esti-1  |     --- PASS: TestLakectlFsUpload/single_file_with_recursive (0.10s)
esti-1  |     --- PASS: TestLakectlFsUpload/dir (0.20s)
esti-1  |     --- PASS: TestLakectlFsUpload/exist_dir (0.07s)
esti-1  |     --- PASS: TestLakectlFsUpload/dir_without_recursive (0.07s)
esti-1  |     --- FAIL: TestLakectlFsUpload/dir_without_recursive_to_file (1.64s)
esti-1  | === RUN   TestLakectlFsPresign
esti-1  |     lakectl_test.go:604: GetStorageConfig failed with stats: 502 Bad Gateway
esti-1  | --- FAIL: TestLakectlFsPresign (0.01s)

Resource leak 3/9/2024: Looked at this with Guy- it seems like when the job to load data fails(batch/job/load-sample-data), the while environment doesn't get cleaned up and these slowly leak and take up all the K8s resources

We manually cleaned this up, but we can expect these to keep leaking

We should add a timeout to the sample data shell script

Note: destroy-controlplane job fails sometimes:

wait.go:104: [debug] beginning wait for 33 resources to be deleted with timeout of 5m0s
uninstall.go:155: [debug] purge requested for ci-3711-2bd1b8-cloud-control-plane
Error: uninstallation completed with 1 error(s): context deadline exceeded
helm.go:84: [debug] uninstallation completed with 1 error(s): context deadline exceeded
helm.sh/helm/v3/pkg/action.(*Uninstall).Run
    helm.sh/helm/v3/pkg/action/uninstall.go:163
main.newUninstallCmd.func2
    helm.sh/helm/v3/cmd/helm/uninstall.go:60
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.8.0/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.8.0/command.go:1115
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.8.0/command.go:1039
main.main
    helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
    runtime/proc.go:271
runtime.goexit
    runtime/asm_amd64.s:1[69](https://github.com/treeverse/cloud-controlplane/actions/runs/10661676834/job/29555638228?pr=3711#step:5:70)5
Error: Process completed with exit code 1.

Expected behavior

No response

lakeFS version

No response

How lakeFS is installed

No response

Affected clients

No response

Relevant log output

No response

Contact details

nadav.steindler@treeverse.com

nadavsteindler commented 2 months ago

https://github.com/treeverse/cloud-controlplane/issues/3720