longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.95k stars 585 forks source link

[BUG] System restore with backing image could fail due to backing image checksum mismatch #9041

Closed yangchiu closed 1 month ago

yangchiu commented 1 month ago

Describe the bug

Recently, test case test_system_backup_and_restore_volume_with_backingimage has been failing on both v1.7.x-head and master-head from time to time:

https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/973/testReport/junit/tests/test_system_backup_restore/test_system_backup_and_restore_volume_with_backingimage_nfs_/

https://ci.longhorn.io/job/public/job/v1.7.x/job/v1.7.x-longhorn-tests-sles-amd64/4/testReport/junit/tests/test_system_backup_restore/test_system_backup_and_restore_volume_with_backingimage_s3_/

https://ci.longhorn.io/job/public/job/v1.7.x/job/v1.7.x-longhorn-upgrade-tests-sles-amd64/5/testReport/junit/tests/test_system_backup_restore/test_system_backup_and_restore_volume_with_backingimage_s3_/

It could remain stuck in system restoration indefinitely:

# kubectl get systemrestores.longhorn.io -n longhorn-system
NAME                         STATE       AGE
test-system-restore-8kwef1   Restoring   8h

Probably because the restored backing image checksum mismatch. The Current SHA512 Checksum is bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ffe555b94e7d4007add6e1a25f4526885eb25c53ce38f7d344dd4925b9f2cb5d3b, but the Expected SHA512 Checksum is 304f3ed30ca6878e9056ee6f1b02b328239f0d0c2c1272840998212f9734b196371560b3b939037e4f4c2884ce457c2cbc9f0621f4f5d1ca983983c8cdf8cd9a:

system_restore

I've also checked the backing image backup in the backup store, and the checksum is not correct as well:

# cd storage/backupbucket/backupstore/backupstore/backing-images/backing-images/bi-test                   
sh-4.4# cat backing-image.cfg 
{"Name":"bi-test","Size":"1161728","BlockCount":"1","Checksum":"bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ffe555b94e7d4007add6e1a25f4526885eb25c53ce38f7d344dd4925b9f2cb5d3b","Labels":null,"CompressionMethod":"lz4","CreatedTime":"2024-07-18T02:26:17Z","CompleteTime":"2024-07-18T02:26:19Z","ProcessingBlocks":{"Blocks":{}},"Blocks":[{"Offset":0,"BlockChecksum":"bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ff"}]}

To Reproduce

Run test case test_system_backup_and_restore_volume_with_backingimage repeatedly.

Expected behavior

Support bundle for troubleshooting

supportbundle_3802211e-5fd7-4e15-8a9b-e95b927fbf11_2024-07-19T01-42-43Z.zip

Environment

Additional context

derekbit commented 1 month ago

cc @ChanYiLin

ChanYiLin commented 1 month ago

it is quite weird, the testing image should be parrot.raw and when doing backup the correct config should be like following especially the "BlockCount" should be "6"

{"Name":"parrot","Size":"33554432","BlockCount":"6","Checksum":"304f3ed30ca6878e9056ee6f1b02b328239f0d0c2c1272840998212f9734b196371560b3b939037e4f4c2884ce457c2cbc9f0621f4f5d1ca983983c8cdf8cd9a","Labels":null,"CompressionMethod":"lz4","CreatedTime":"2024-07-19T04:12:51Z","CompleteTime":"2024-07-19T04:12:53Z","ProcessingBlocks":{"Blocks":{}},"Blocks":[{"Offset":0,"BlockChecksum":"03060d6b6c4c19737a263979140cb84f7f5f7e53e5333b93e4154f7b62364ed5"},{"Offset":8388608,"BlockChecksum":"c8684f4bb7725397b97a159aea819808ff305f8f3913b9ac59362d991f5705e4"},{"Offset":14680064,"BlockChecksum":"731859029215873fdac1c9f2f8bd25a334abf0f3a9e1b057cf2cacc2826d86b0"},{"Offset":16777216,"BlockChecksum":"46c343666e37a8c6f6a49840a4aecfe5fb29b72fc3dc3013ab351685084f01c3"},{"Offset":23068672,"BlockChecksum":"731859029215873fdac1c9f2f8bd25a334abf0f3a9e1b057cf2cacc2826d86b0"},{"Offset":25165824,"BlockChecksum":"e4f3a9580b7719cc0f4c77185183ffe4cdfd2917f09f1cc998d49d6b82b72d6c"}]}

It seems the test use wrong backing image in the backup store but why is there another backing image in the backup store, some tests doesn't clean it up?

derekbit commented 1 month ago

It seems the test use wrong backing image in the backup store but why is there another backing image in the backup store, some tests doesn't clean it up?

Can we make sure all other backing images are thoroughly cleaned up before executing the test?

ChanYiLin commented 1 month ago

According to my discussion with @roger-ryao before about this issue

Jack: did it fail in single test ever? Roger: No, I didn't observe it failing in the single test or execute all test case in test_system_backup_restore.py. Jack: And in full regression it just fails sometimes Roger: YEp, When it passes, the system restore completes within 50 seconds.

ChanYiLin commented 1 month ago

It seems the test use wrong backing image in the backup store but why is there another backing image in the backup store, some tests doesn't clean it up?

Can we make sure all other backing images are thoroughly cleaned up before executing the test?

Yes, we can clean up all the backup backing image before running the tests

ChanYiLin commented 1 month ago

Oh I see bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ffe555b94e7d4007add6e1a25f4526885eb25c53ce38f7d344dd4925b9f2cb5d3b is the checksum of parrot.qcow2 image So maybe the previous test didn't cleanup the backup backing image resource and since the name was always bi-test so it considered it was already backed up

longhorn-io-github-bot commented 1 month ago

Pre Ready-For-Testing Checklist

PRs:

roger-ryao commented 1 month ago

Verified on master-head/v1.7.x 20230723

The test steps https://github.com/longhorn/longhorn/issues/9041#issuecomment-2238664828

Result passed

After https://github.com/longhorn/longhorn-tests/pull/1989 and https://github.com/longhorn/longhorn-tests/pull/1990 were merged, the test_system_backup_and_restore_volume_with_backingimage test passed in regression.