guardian / grid

The Guardian’s image management system
https://www.theguardian.com/info/developer-blog/2015/aug/12/open-sourcing-grid-image-service
Apache License 2.0
1.44k stars 119 forks source link

[reaper] use bulk API for soft/hard deleting batches from ES, so we don't need follow-up query to confirm success in ES #4151

Closed twrichards closed 9 months ago

twrichards commented 9 months ago

Small but important follow-up to https://github.com/guardian/grid/pull/4145 - where it looks like the delete_by_query was being processed asynchronously and so the guard added in https://github.com/guardian/grid/pull/4145/commits/d3c8b0a9edf5a87a3cce807eaeded5487b633e61 was firing, even though the ES record was disappearing - this in turn meant the S3 files will have been orphaned off.

In this PR we...

  • ~explicitly 'wait for completion' of the update_by_query and delete_by_query~
  • ~explicitly verify the IDs actually updated / deleted with subsequent query~
  • ~don't throw, but log any discrepancy, since we now continue to process the ones which were successfully updated / deleted~

... use the bulk API instead because it waits on completion of the actions automatically and returns the success/error of each - meaning we don't need a follow-up query to verify.

andrew-nowak commented 9 months ago

well the ES docs are super unclear on the point but I'm 99.9% certain the default is already wait_for_completion=true - try on TEST cerebro on the empty index with a match all query and no update step, and try with and without ?wait_for_completion=false (and true) - you have a completely different response body if wait_for_completion is false (just a task id which you need to feed into the tasks api) so the response here would fail to parse https://github.com/guardian/grid/blob/c4b4ebe0aec38419035a6402b2221298cf162e73/thrall/app/lib/elasticsearch/ElasticSearch.scala#L331C13-L331C29

twrichards commented 9 months ago

well the ES docs are super unclear on the point but I'm 99.9% certain the default is already wait_for_completion=true - try on TEST cerebro on the empty index with a match all query and no update step, and try with and without ?wait_for_completion=false (and true) - you have a completely different response body if wait_for_completion is false (just a task id which you need to feed into the tasks api) so the response here would fail to parse c4b4ebe/thrall/app/lib/elasticsearch/ElasticSearch.scala#L331C13-L331C29

it might be the default for the update_by_query, but I added it to be explicit that it should be a synchronous call and the soft delete endpoint is working fine, consider also that the return type is just UpdateByQueryResponse in elastic4s. Curiously the return type of delete_by_query in elastic4s is an Either[DeleteByQueryResponse, CreateTaskResponse] and the previous implementation which maps left projection suggests that its CreateTaskResponse so I'm hoping this PR will ensure its a Left (i.e. DeleteByQueryResponse) and not a Right.

twrichards commented 9 months ago

well the ES docs are super unclear on the point but I'm 99.9% certain the default is already wait_for_completion=true - try on TEST cerebro on the empty index with a match all query and no update step, and try with and without ?wait_for_completion=false (and true) - you have a completely different response body if wait_for_completion is false (just a task id which you need to feed into the tasks api) so the response here would fail to parse c4b4ebe/thrall/app/lib/elasticsearch/ElasticSearch.scala#L331C13-L331C29

it might be the default for the update_by_query, but I added it to be explicit that it should be a synchronous call and the soft delete endpoint is working fine, consider also that the return type is just UpdateByQueryResponse in elastic4s. Curiously the return type of delete_by_query in elastic4s is an Either[DeleteByQueryResponse, CreateTaskResponse] and the previous implementation which maps left projection suggests that its CreateTaskResponse so I'm hoping this PR will ensure its a Left (i.e. DeleteByQueryResponse) and not a Right.

actually the type situation is baffling

image
twrichards commented 9 months ago

tested on TEST

prout-bot commented 9 months ago

Seen on cropper (merged by @twrichards 8 minutes and 53 seconds ago) Please check your changes!

prout-bot commented 9 months ago

Seen on collections (merged by @twrichards 9 minutes ago) Please check your changes!

prout-bot commented 9 months ago

Seen on auth, image-loader (merged by @twrichards 9 minutes and 5 seconds ago) Please check your changes!

prout-bot commented 9 months ago

Seen on leases, usage, kahuna (merged by @twrichards 9 minutes and 12 seconds ago) Please check your changes!

prout-bot commented 9 months ago

Seen on thrall, kahuna (merged by @twrichards 9 minutes and 13 seconds ago) Please check your changes!

prout-bot commented 9 months ago

Seen on metadata-editor, leases, usage (merged by @twrichards 9 minutes and 19 seconds ago) Please check your changes!

prout-bot commented 9 months ago

Seen on media-api (merged by @twrichards 9 minutes and 30 seconds ago) Please check your changes!