This doesn't happen often but it does happen. So this case needs to be
specifically handled to get >99.999% reliability.
Repro:
1. Use /content-gs/pre-upload/... to check for object presence.
2. For each large missing object, uploads it to Cloud Storage.
3. Verification is done asynchronously via task queues afterward.
4. While Cloud Storage upload reported success, it fails to serve the file.
Expected:
The isolate.py / isolateserver.py client code ensures that the file is
accessible for download before concluding that the upload succeeded.
Actual:
Verification is done asynchronously after the upload, so the files uploaded may
"disappear" after the upload or never become downloadable. This causes the
Swarming tasks that need this file to fail to retrieve it, causing cascading
failures.
Action Item:
- Add new isolate server endpoint to ensure the entities are in a verified
state, e.g. ContentEntry.is_verified == True for each item uploaded.
https://code.google.com/p/swarming/source/browse/services/isolate/model.py#67
https://code.google.com/p/swarming/source/browse/services/isolate/handlers_front
end.py#841
- Change isolateserver.py to use this new endpoint, and block uploading until
the verification is complete.
https://code.google.com/p/swarming/source/browse/isolateserver.py?repo=client#46
3
Original issue reported on code.google.com by maruel@chromium.org on 6 Aug 2014 at 4:53
Original issue reported on code.google.com by
maruel@chromium.org
on 6 Aug 2014 at 4:53