pombreda / swarming

Automatically exported from code.google.com/p/swarming
Apache License 2.0
0 stars 0 forks source link

Create new endpoint for Isolate Server to wait for all objects to become verified. #135

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
This doesn't happen often but it does happen. So this case needs to be 
specifically handled to get >99.999% reliability.

Repro:
1. Use /content-gs/pre-upload/... to check for object presence.
2. For each large missing object, uploads it to Cloud Storage.
3. Verification is done asynchronously via task queues afterward.
4. While Cloud Storage upload reported success, it fails to serve the file.

Expected:
The isolate.py / isolateserver.py client code ensures that the file is 
accessible for download before concluding that the upload succeeded.

Actual:
Verification is done asynchronously after the upload, so the files uploaded may 
"disappear" after the upload or never become downloadable. This causes the 
Swarming tasks that need this file to fail to retrieve it, causing cascading 
failures.

Action Item:
- Add new isolate server endpoint to ensure the entities are in a verified 
state, e.g. ContentEntry.is_verified == True for each item uploaded.
https://code.google.com/p/swarming/source/browse/services/isolate/model.py#67
https://code.google.com/p/swarming/source/browse/services/isolate/handlers_front
end.py#841

- Change isolateserver.py to use this new endpoint, and block uploading until 
the verification is complete.
https://code.google.com/p/swarming/source/browse/isolateserver.py?repo=client#46
3

Original issue reported on code.google.com by maruel@chromium.org on 6 Aug 2014 at 4:53

GoogleCodeExporter commented 9 years ago

Original comment by maruel@chromium.org on 6 Aug 2014 at 4:54