Delete by query causing fielddata cache spike leading to 429

blacar commented 1 year ago

This ticket is the result of two weeks of experiments.

I'll try to put all the information because It might be something wrong with RestHighLevelClient doing deleteByQuery. I have been two weeks betting that it should be a problem on my side or a problem on Elastic (performance, configuration) but after several experiments I need to present this to you because I have no explanation.

First of all, I have prior knowledge of Elastic and I am aware that updates and deletes are expensive operations, this is not about that.

CONTEXT

This is a spring-boot microservice running on Java 11 using spring-data-elasticsearch 4.2.11 to run operations on elastic cluster.
We are on pre-launch experiments and I have an environment that mirrors our production traffic but allows me total control of it.
We have a lot of ingest operations, a lot of query operations, some significant rate of update operations, and few delete operations.

We are using RestHighLevelClient configured like this:

  public RestHighLevelClient elasticsearchClient() {
    final HttpHeaders compatibilityHeaders = new HttpHeaders();
    compatibilityHeaders.add("Accept", "application/vnd.elasticsearch+json;compatible-with=7");
    compatibilityHeaders.add("Content-Type", "application/vnd.elasticsearch+json;"
      + "compatible-with=7");
    final ClientConfiguration clientConfiguration = ClientConfiguration.builder()
      .connectedTo(eshostname + ":" + esport)
      .usingSsl()
      .withBasicAuth(username, password)
      .withDefaultHeaders(compatibilityHeaders)
      .build();
    return RestClients.create(clientConfiguration).rest();
  }

As said we do many ingest and query operations ... as example:

    final BoolQueryBuilder boolQuery = QueryBuilders
      .boolQuery()
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2))
      .filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lte(s3));
    final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
    nsq.addSort(Sort.by(Direction.DESC, CREATED_SEARCH_FIELD));
    nsq.setMaxResults(size);

We do also updateByQuery operations ... like this:

    final BoolQueryBuilder boolQuery = QueryBuilders.boolQuery()
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
      .filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lt(s3))
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2));
    final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
    return UpdateQuery.builder(nsq)
      .withScriptType(ScriptType.INLINE)
      .withScript(UPDATE_SCRIPT)
      .withParams(UPDATE_PARAMS)
      .build();

update script looks like this:

"ctx._source.FIELD_4 = params.FIELD_4; ctx._source.FIELD_5 = params.FIELD_5; ctx._source.FIELD_6 = params.FIELD_6; ctx._source.FIELD_3 = params.FIELD_3"

finally we do deleteByQuery operations with the same query as update operations. Of course no script in that case.

ISSUE

All operations run like a charm except deleteByQuery. At the moment deleteByQuery is enabled (even these being just a fraction of the traffic, even when there are much more UPDATE operations) the cluster starts to get into problems. ALL delete operations timeout, although the records are removed from the cluster. The fielddata cache starts to grow significantly, eventually causing the GC usage and duration to spike, eventually causing the CPU to spike, and finally causing the circuit breaker [parent] to be triggered starting to respond 429 TOO MANY REQUEST to our operations.

This is no matter of the size of the result of the delete query, delete queries bringing just 1 o 2 documents cause the same effect. Please remember that the amount of deleted queries is small.

This only happens on deletes. If I replace deletes with updates (using the same query and a script that updates four fields) the cluster is stable. This alone is very weird to me since updates are expected to be more expensive than updates.

NOTE If I bypass spring-data-elasticsearch and use feign client sending POST HTTP requests directly without the RestHighLevelClient for the delete operations, then the cluster is stable. This leads me to think that there might be something wrong with the deletes that RestHighLevelClient is sending. It feels like something is not closed (connection timeout).

Here are some screenshots:

Timeout exception on ALL delete operations

org.springframework.dao.DataAccessResourceFailureException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]; nested exception is java.lang.RuntimeException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]
    at org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:75)
    at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.translateException(ElasticsearchRestTemplate.java:402)
    at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:385)
    at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.delete(ElasticsearchRestTemplate.java:224)
    at com.xxx.xxx.service.xxx.deleteByQuery(xxx.java:380)

Metrics when deletes are enabled
(we disable updates at the same time so 100% of the spikes are related to deletes)

sothawo commented 1 year ago

Might be worth to try an add an intercepting proxy to the setup to capture the exact request that is sent out by the delete by query request.

Spring Data Elasticsearch 4.2 is outdated and out of maintenance for over one year now. The last version of the 4.x releases (4.4.x) has reached EOL last week.

When looking at the code in the 5.0 branch that uses the then already deprecated RestHighLevelClient I can see that the refresh parameter for the delete request is set to true, that might be causing the problem.

Can you reproduce this in a setup using the maintained versions (5.1 or 5.0), they both still allow the old client to be used, or better, can you switch to a supported version and use the current Elasticsearch client?

blacar commented 1 year ago

yeah ... if refresh parameter is set then I can understand that every delete request might be triggering an index refresh which is very likely the reason for the overload. I don't know the relation between the index refresh operation and the fielddata cache, but that's something from the elastic side.

I would try the maintained versions, but if the refresh param is there I would expect the same behavior. I will stay on feign for deletes until I am ready to switch to the current Elasticsearch client.

I will ping back if I find something more.

Thxs!

spring-projects / spring-data-elasticsearch

Delete by query causing fielddata cache spike leading to 429 #2550

CONTEXT

ISSUE