opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.78k stars 1.82k forks source link

[BUG] Derived Fields feature causes search query performance regression on indices with many field mappings #16603

Open jicohen opened 3 days ago

jicohen commented 3 days ago

Describe the bug

When upgrading to 2.15.0 we experienced a significant search query performance regression in our test suite. After comparing profiling output from the same test in 2.14.0, it turned out to be the result of the new 'derived fields' feature. Specifically, it appears that for every field referenced in a query it iterates over every field defined in the index mappings. Until a few days ago each iteration performed an expensive string comparison.

    @Override
    public Set<String> resolvePattern(String pattern) {
        Set<String> derivedFields = new HashSet<>();
        if (queryShardContext != null && queryShardContext.getMapperService() != null) {
            for (MappedFieldType fieldType : queryShardContext.getMapperService().fieldTypes()) {
                if (Regex.simpleMatch(pattern, fieldType.name()) && fieldType instanceof DerivedFieldType) {
                    derivedFields.add(fieldType.name());
                }
            }
        }
        for (String fieldName : derivedFieldTypeMap.keySet()) {
            if (Regex.simpleMatch(pattern, fieldName)) {
                derivedFields.add(fieldName);
            }
        }
        return derivedFields;
    }

https://github.com/opensearch-project/OpenSearch/blob/61dbcd0795c9bfe9b81e5762175414bc38bbcadf/server/src/main/java/org/opensearch/index/mapper/DefaultDerivedFieldResolver.java#L69

Our use case is a little unusual in that we have a large number (~10,000) of fields defined in our mappings and we issue large queries (~150 clauses). This issue added over a second of overhead to each query we performed (~7 ms per clause) in our testing. This slowdown was present despite not using the derived field type feature at all - each referenced field in the query would cause an iteration through the entire set of mapped fields. This issue was partially addressed on the main branch a few days ago as mentioned above - by short circuiting the expensive string comparison operation if the field mapping is not a derived field type. However, the iteration itself is still present. It should be possible to precompute the set of derived fields outside of the context of a particular query (e.g. add a method MapperService.derivedFieldTypes()) and iterate over only those and not every mapped field in the index.

As a workaround, for now we have disabled the 'derived fields' feature which replaces the DefaultDerivedFieldResolver with the NoOpDerivedFieldResolver.

Related component

Search:Performance

To Reproduce

  1. On a 2.14.0 and 2.15.0 cluster do the following
  2. Create an index with a large (e.g. ~10,000) number of field mappings
  3. Issue a large boolean query with match clauses on many fields (e.g. 200)
  4. Observe the slowness of the query on 2.15.0 vs 2.14.0
  5. Run the OpenSearch clusters in a profiler to observe where the extra overhead is coming from

Expected behavior

I would expect search query performance for the scenario described above (large number of defined field mappings and large number of query clauses in the search query) to perform comparably in the newer releases of OpenSearch as it did in 2.14.0.

Additional Details

Plugins

**Host/Environment (please complete the following information):** - OS: Ubuntu - Version 2.15.0
dbwiddis commented 2 days ago

Relevant PR: https://github.com/opensearch-project/OpenSearch/pull/13720 CC: @rishabhmaurya @msfroh

msfroh commented 2 days ago

@rishabhmaurya -- maybe we could define an implicit index setting that uses the NoOpDerivedFieldResolver unless at least one derived field is in the mapping? (That is, we set a Boolean flag at the index level that flips to true if/when a derived field is added to the mapping. Alternatively, if the iteration can only iterate over derived fields, it should be a no-op if no derived fields exist.)

What do you think?

sandeshkr419 commented 7 hours ago

@rishabh6788 Do we have something in OSB to capture/monitor this?

rishabhmaurya commented 7 hours ago

@jicohen thanks for reporting. @msfroh sounds like a plan

rishabh6788 commented 6 hours ago

@rishabh6788 Do we have something in OSB to capture/monitor this?

At this point I don't think there is any workload that supports this use-case.