[BUG] Field masking has inconsistent memory issues with certain queries

peternied commented 8 months ago

What is the bug? We've seen reports of heap memory usage spiking with field masking enabled. With how masking is implemented at the leaf level its possible that certain types of queries cause the masked fields to be materialized even when they are not used.

How can one reproduce the bug? Steps to reproduce the behavior:

Checkout this branch https://github.com/opensearch-project/security/compare/main...peternied:masking-perf
Run ./gradlew integrationTest --tests org.opensearch.security.MaskingTests
Analyze the results

Baseline behavior

Baseline query with the following format:

final SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	106	888,814	3,621,248	191,136	480,198
reader		100	105	808,317	1,470,368	517,640	195,083
reader	ROLE_WITH_NO_MASKING	100	105	813,600	1,602,272	536,416	206,571
reader	MASKING_LOW_REPEAT_VALUE	100	105	936,948	1,631,088	618,704	195,653
reader	MASKING_RANDOM_LONG	100	105	919,187	1,593,456	548,264	236,892
reader	MASKING_RANDOM_STRING	100	105	951,826	1,656,432	564,784	191,008

Creating 3 Indices with 50000 Documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	110	873,379	4,872,552	575,208	508,483
reader		100	109	753,489	4,020,832	567,304	341,413
reader	ROLE_WITH_NO_MASKING	100	108	715,624	988,688	553,504	78,209
reader	MASKING_LOW_REPEAT_VALUE	100	108	970,238	1,222,008	650,520	89,596
reader	MASKING_RANDOM_LONG	100	108	942,349	1,215,448	679,768	115,690
reader	MASKING_RANDOM_STRING	100	108	971,219	1,646,856	660,856	119,863

Query with Aggregate Filter

Query with the following format:

SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(AggregationBuilders.filters("my-filter", QueryBuilders.queryStringQuery("last")));
ssb.aggregation(AggregationBuilders.count("counting").field("genre.keyword"));
ssb.aggregation(AggregationBuilders.avg("averaging").field("longId"));
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	106	1,069,939	4,905,008	288,144	679,704
reader		100	105	877,725	1,604,048	562,032	215,923
reader	ROLE_WITH_NO_MASKING	100	106	898,739	1,860,632	354,032	260,686
reader	MASKING_LOW_REPEAT_VALUE	100	106	2,441,865	3,504,944	2,040,768	235,363
reader	MASKING_RANDOM_LONG	100	106	2,500,135	3,215,712	1,984,096	247,712
reader	MASKING_RANDOM_STRING	100	106	2,414,665	3,330,960	2,075,848	232,634

Creating 3 Indices with 50000 Documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	113	1,265,168	2,698,248	993,528	301,134
reader		100	112	1,308,585	5,408,936	851,672	579,648
reader	ROLE_WITH_NO_MASKING	100	109	1,021,555	1,504,480	798,400	136,735
reader	MASKING_LOW_REPEAT_VALUE	100	114	2,420,922	7,183,000	1,828,456	660,701
reader	MASKING_RANDOM_LONG	100	112	2,225,070	2,815,568	1,994,176	142,038
reader	MASKING_RANDOM_STRING	100	111	2,169,297	2,673,376	1,964,144	128,983

Term Match Query:

Query with the following format:

SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(AggregationBuilders.filters("my-filter",  QueryBuilders.termQuery("title","last")));
ssb.aggregation(AggregationBuilders.count("counting").field("genre.keyword"));
ssb.aggregation(AggregationBuilders.avg("averaging").field("longId"));
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	106	1,154,269	2,497,744	441,920	353,223
reader		100	105	906,467	1,586,928	563,872	203,493
reader	ROLE_WITH_NO_MASKING	100	105	873,926	1,390,368	453,184	199,539
reader	MASKING_LOW_REPEAT_VALUE	100	105	1,184,290	1,857,944	840,680	218,853
reader	MASKING_RANDOM_LONG	100	106	1,184,382	1,847,496	795,520	224,178
reader	MASKING_RANDOM_STRING	100	105	1,162,373	1,811,936	782,312	215,824

Creating 3 Indices with 50000 Documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	110	966,775	1,264,696	825,536	79,341
reader		100	109	971,058	1,393,544	779,968	95,795
reader	ROLE_WITH_NO_MASKING	100	108	964,071	1,392,504	765,104	90,630
reader	MASKING_LOW_REPEAT_VALUE	100	108	1,249,295	1,707,896	1,028,104	120,058
reader	MASKING_RANDOM_LONG	100	109	1,225,250	1,717,792	982,744	111,090
reader	MASKING_RANDOM_STRING	100	108	1,227,667	1,667,544	991,424	116,841

peternied commented 8 months ago

Analysis

Based on these results here are a couple of the findings. Overall, I would recommend avoiding StringQueries when using masked fields as there appears to be an disproportional use of memory compared to other search types. There are likely optimizations that could be added for some kinds of fields - such as building a cache of masked values to avoid the allocation tax, but its unclear if this would be helpful in high-cardinality scenarios.

The following are a couple of sources for this recommendations

GC Impact

The attempts column increases if GC was called during a test run, as that makes the memory calculation likely to be incorrect. So when attempts is higher, more GCs were performed - higher JVM usage. Baseline performs best, with Term being impacted slightly more often, and finally StringQuery seems to trip GC more frequently.

Masking Value Type

The complexity of the masking operation, masking an LONG vs a STRING did not appear to impact the usage, how ever there is a clear increase of between 15-30% memory usage when a field is masked in either Baseline or Term scenarios. This overhead is semi-expected due to the masking operations happening as the data is accessed.

Memory expense of `StringQueries`

When comparing Term vs StringQueries there is a 2x increase in memory usage when masked fields are involved, maybe additional rework that doesn't align with the behavoir pattern when looking at non-masked fields which are generally more memory expensive.

stephen-crawford commented 7 months ago

[Triage] Hi @peternied, thank you for the extremely detailed issue. You provided a lot of good details and we can take a couple of paths forward:

update documentation
look for bugs causing overhead
improve resource utilization

The first task is definitely actionable, the other two tasks can be investigated on an RFC or help wanted basis since it may not be too clear how to correct this quickly.

cwperks commented 4 months ago

@peternied I noticed that field masking is being applied on search requests where the request size is 0 and the only aggregations requested are count or cardinality. This may be an area that can be optimized to reduce the memory footprint of field masking.

This PR on my fork contains an example: https://github.com/cwperks/security/pull/24 ~~https://github.com/opensearch-project/security/pull/4362~~

Create an index and index some documents

Sample Documents

``` { "title": "Magnum Opus", "artist": "First artist", "lyrics": "Very deep subject", "stars": 1, "genre": "rock" }, { "title": "Song 1+1", "artist": "String", "lyrics": "Once upon a time", "stars": 2, "genre": "blues" }, { "title": "Next song", "artist": "Twins", "lyrics": "giant nonsense", "stars": 3, "genre": "jazz" }, { "title": "Poison", "artist": "No!", "lyrics": "Much too much", "stars": 4, "genre": "rock" }, { "title": "ABC", "artist": "First artist", "lyrics": "abcdefghijklmnopqrstuvwxyz", "stars": 7, "genre": "rock" } ```

Assign a user to be able to search the index, but anonymize the artist field.
Run a Search Request of size 0 with a cardinality aggregation (uniqueness count)

SearchRequest of size 0 with cardinality aggregation

``` POST /songs/_search { "size": 0, "aggs": { "unique_artists": { "cardinality": { "field": "artist.keyword" } } } } ```

For these types of queries, is applying the field masking necessary? It looks like extra overhead for a query that is already not requesting the sensitive info. If the size is >0 or has aggregations other than cardinality and count (such as max or min where raw data can be exposed) then the field masking should be applied.

stephen-crawford commented 3 months ago

Reproduced Steps:

baseline

3 indices with 5000 documents

Here's the updated table with the last row moved to the top:

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	108	861,267	1,333,568	697,264	126,293
admin		100	123	3,861,272	24,316,608	632,368	4,897,820
reader	ROLE_WITH_NO_MASKING	100	108	681,459	943,384	504,032	74,876
reader	MASKING_LOW_REPEAT_VALUE	100	107	663,067	835,976	524,560	63,470
reader	MASKING_RANDOM_LONG	100	107	938,830	1,236,864	646,680	156,187
reader	MASKING_RANDOM_STRING	100	107	966,501	1,299,488	601,336	138,535

3 indices with 50000 documents

Here's the updated table with the last row moved to the top:

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	106	982,778	1,271,616	717,296	134,142
admin		100	110	904,744	1,548,496	526,032	249,730
reader	ROLE_WITH_NO_MASKING	100	107	667,823	1,063,920	446,520	74,430
reader	MASKING_LOW_REPEAT_VALUE	100	107	665,148	878,424	529,000	69,972
reader	MASKING_RANDOM_LONG	100	107	1,049,116	1,654,584	642,648	153,067
reader	MASKING_RANDOM_STRING	100	107	984,706	1,248,968	566,432	145,058

aggregateFilterTermQueryScenarios

3 indices with 5000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	107	1,824,332	10,282,096	293,264	1,807,192
reader		100	105	744,310	1,247,896	474,384	180,697
reader	ROLE_WITH_NO_MASKING	100	105	746,171	1,337,872	464,712	180,388
reader	MASKING_LOW_REPEAT_VALUE	100	106	1,564,257	2,180,256	1,247,528	208,667
reader	MASKING_RANDOM_LONG	100	106	1,385,117	2,133,552	848,360	283,312
reader	MASKING_RANDOM_STRING	100	105	1,225,246	2,181,888	799,584	258,089

3 indices with 50000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	110	847,600	1,481,680	626,416	109,946
reader		100	108	869,619	1,576,624	621,760	180,665
reader	ROLE_WITH_NO_MASKING	100	107	763,011	1,086,568	586,408	90,824
reader	MASKING_LOW_REPEAT_VALUE	100	110	2,345,859	2,830,104	1,167,248	216,459
reader	MASKING_RANDOM_LONG	100	110	2,249,719	2,523,008	2,058,280	81,800
reader	MASKING_RANDOM_STRING	100	109	2,272,825	2,807,928	1,275,320	196,273

aggregateFilterStringQueryScenarios

3 indices with 5000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	111	4,848,198	29,831,272	812,496	6,303,649
reader		100	106	1,043,223	1,562,624	781,976	145,309
reader	ROLE_WITH_NO_MASKING	100	106	1,014,046	1,419,760	718,328	143,819
reader	MASKING_LOW_REPEAT_VALUE	100	110	3,269,142	3,852,016	2,437,888	190,123
reader	MASKING_RANDOM_LONG	100	107	3,206,087	3,532,984	2,926,448	128,675
reader	MASKING_RANDOM_STRING	100	108	3,184,283	3,886,000	2,569,528	239,173

3 indices with 50000 documents

Here's the provided data reformatted into the specified table format:

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
admin		100	110	1,167,121	1,493,264	964,176	91,059
reader		100	107	1,064,978	1,516,824	747,968	155,176
reader	ROLE_WITH_NO_MASKING	100	107	1,065,616	1,606,152	697,152	176,486
reader	MASKING_LOW_REPEAT_VALUE	100	115	5,272,236	5,704,016	4,998,000	135,232
reader	MASKING_RANDOM_LONG	100	113	5,387,294	5,730,312	5,162,648	114,408
reader	MASKING_RANDOM_STRING	100	113	5,223,454	5,552,800	5,001,576	100,145

With Craig's Changes:

baseline

3 indices with 5000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	107	683,951	1,299,064	61,144	255,714
reader	MASKING_RANDOM_STRING	100	108	724,074	1,193,984	532,120	121,375
reader	MASKING_RANDOM_LONG	100	108	728,235	1,209,752	499,328	106,161
reader	MASKING_LOW_REPEAT_VALUE	100	109	685,820	896,616	437,232	69,834
reader	ROLE_WITH_NO_MASKING	100	109	692,526	943,160	512,136	72,825
admin		100	122	2,736,726	19,272,432	622,224	3,459,203

3 indices with 50000 documents

Here is the formatted data in a table:

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	107	706,473	1,086,544	528,256	90,361
reader	MASKING_RANDOM_STRING	100	107	715,920	1,099,000	545,888	92,117
reader	MASKING_RANDOM_LONG	100	106	706,888	1,255,768	517,880	109,191
reader	MASKING_LOW_REPEAT_VALUE	100	107	797,238	1,289,392	533,688	156,511
reader	ROLE_WITH_NO_MASKING	100	107	805,953	1,080,208	485,448	132,880
admin		100	108	702,158	1,174,728	531,128	91,654

aggregateFilterTermQueryScenarios

3 indices with 5000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	105	1,240,216	1,834,912	895,920	182,783
admin		100	108	5,719,315	41,066,960	636,096	8,486,929
reader	ROLE_WITH_NO_MASKING	100	105	752,132	1,409,144	526,120	180,726
reader	MASKING_LOW_REPEAT_VALUE	100	105	767,652	1,255,280	526,224	169,239
reader	MASKING_RANDOM_LONG	100	105	1,290,737	2,092,840	916,792	219,425
reader	MASKING_RANDOM_STRING	100	106	1,268,168	1,835,760	927,896	207,900

3 indices with 50000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	107	1,411,215	1,729,784	1,167,728	108,204
reader	MASKING_RANDOM_STRING	100	107	1,458,550	1,748,408	1,289,344	81,905
reader	MASKING_RANDOM_LONG	100	108	1,519,651	1,792,320	1,256,032	117,450
reader	MASKING_LOW_REPEAT_VALUE	100	106	890,869	1,327,336	544,312	175,251
reader	ROLE_WITH_NO_MASKING	100	108	904,850	1,295,392	677,272	135,768
admin		100	108	924,740	1,534,144	626,872	191,236

aggregateFilterStringQueryScenarios

3 indices with 5000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	112	3,187,505	3,573,144	2,712,240	160,524
reader	MASKING_RANDOM_STRING	100	113	3,317,756	3,626,848	2,635,784	112,230
reader	MASKING_RANDOM_LONG	100	115	3,341,987	3,840,784	3,120,224	151,827
reader	MASKING_LOW_REPEAT_VALUE	100	108	1,065,400	1,433,872	763,416	148,780
reader	ROLE_WITH_NO_MASKING	100	110	1,089,621	1,363,248	819,416	145,270
admin		100	125	2,498,277	14,298,976	279,928	2,153,253

3 indices with 50000 documents

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader		100	113	4,825,162	5,211,376	4,600,384	114,015
reader	MASKING_RANDOM_STRING	100	114	4,923,908	5,256,120	4,749,248	111,796
reader	MASKING_RANDOM_LONG	100	114	4,893,227	5,327,032	4,687,368	129,189
reader	MASKING_LOW_REPEAT_VALUE	100	108	1,160,330	1,544,976	978,552	81,653
reader	ROLE_WITH_NO_MASKING	100	108	1,166,856	1,414,224	883,584	91,983
admin		100	113	1,361,717	1,961,136	1,106,248	214,172

stephen-crawford commented 3 months ago

In the below image, I have compared the results of the original tables with the results of the tables with Craig's change. Red in the left table means the corresponding value is greater than the equivalent position in the right column tables. As shown the change does directly impact the testing scenarios covered by Peter's test framework for a majority of the measured values.

stephen-crawford commented 3 months ago

Screenshot 2024-06-25 at 1 43 23 PM

peternied commented 3 months ago

Focusing just on aggregateFilterStringQueryScenarios

Admin

Isn't lower memory usage better, admin seems to be doing better in the original vs the updated change? Seems like it was 1GB vs 4GB, that looks like a 400% regression

Reader with role MASKING_RANDOM_STRING

Looking at the difference between the changes just focusing in on MASKING_RANDOM_STRING, that seems like only 5% improvement, while that isn't nothing - its pretty close to the standard deviation of 111,796.

Original

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader	MASKING_RANDOM_STRING	100	113	5,223,454	5,552,800	5,001,576	100,145

With Change

Role	Condition	Count	Attempts	Avg Heap Used	Max Heap Used	Min Heap Used	Std Deviation
reader	MASKING_RANDOM_STRING	100	114	4,923,908	5,256,120	4,749,248	111,796

cwperks commented 3 months ago

@scrawfor99 @peternied The change in https://github.com/cwperks/security/pull/24 is a very narrow change that applies to search requests that have size = 0 and only contain cardinality or count aggregations only. Its probably not representative to do the entire suite of tests when it applies narrowly in those 2 use-cases.

stephen-crawford commented 3 months ago

Hi @cwperks, of course. No worries. I am sure the change accomplishes what you intended. I just reran the tests to see if it would fix everything or not. It is no problem to add some more changes and see where that goes.

stephen-crawford commented 3 months ago

I went through the code and found some possible refactoring that may help improve the heap use based on the implementation.

I also found some lines I am not sure the purpose of...

First, we should avoid the expanding the aggregation builder factories if we can help it. I noticed we iterate over the list at least 3 separate times within DlsFlsValvleImpl alone. While this is a linear operation, if we are concerned about the heap use, we end up storing additional objects each time we do that so its not great.

We can also avoid the frequent conditionals used when identifying the aggregationBuilder type via instanceof by adding a setHint method to the AB interface all of the aggregations implement (this would happen in core).

We can avoid storing strings in the heap by logging only when the appropriate logger level is set. This would require a larger logger reword but instead of always logging all lines and only printing based on the configured level, we would only log lines which passed the level of configured logging. This matters more than it seems since in some areas we are logging complete requests as JSONs. This will take up a fair amount of space.

We can avoid checking each request for an update or resize request by adding a flag to the interface which allows us to fetch whether they are cache-able.

Likewise, we can have a request signal if it has source.profile or a min doc count of 0.

We can try to intern strings throughout the code.

We can swap to HashSets etc.

We can remove streams for loops since they are less memory efficient.

Condense the SetFLS which duplicates the serialization process done in the DLS logic.

Iws also not clear why we deserialized in this code:


    private void setDlsHeaders(EvaluatedDlsFlsConfig dlsFls, ActionRequest request) {
        if (!dlsFls.getDlsQueriesByIndex().isEmpty()) {
            Map<String, Set<String>> dlsQueries = dlsFls.getDlsQueriesByIndex();

            if (request instanceof ClusterSearchShardsRequest && HeaderHelper.isTrustedClusterRequest(threadContext)) {
                threadContext.addResponseHeader(
                    ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER,
                    Base64Helper.serializeObject((Serializable) dlsQueries)
                );
                if (log.isDebugEnabled()) {
                    log.debug("added response header for DLS info: {}", dlsQueries);
                }
            } else {
                if (threadContext.getHeader(ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER) != null) {
                    Object deserializedDlsQueries = Base64Helper.deserializeObject(
                        threadContext.getHeader(ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER),
                        threadContext.getTransient(ConfigConstants.USE_JDK_SERIALIZATION)
                    );
                    if (!dlsQueries.equals(deserializedDlsQueries)) {
                        throw new OpenSearchSecurityException(
                            ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER + " does not match (SG 900D)"
                        );
                    }
                } else {
                    threadContext.putHeader(
                        ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER,
                        Base64Helper.serializeObject((Serializable) dlsQueries)
                    );
                    if (log.isDebugEnabled()) {
                        log.debug("attach DLS info: {}", dlsQueries);
                    }
                }
            }
        }
    }

Avoiding that at the does not match check would save some heap.

stephen-crawford commented 3 months ago

I will start going through making these changes and rerun the tests. Leaving the interface and logger changes for last.

cwperks commented 3 months ago

Thanks for the deep dive @scrawfor99! How are the performance metrics with the suggested changes implemented?

Should this be split into multiple parts?

Overall optimizations that work for requests where Field Masking is applied
Narrow use-cases where large optimizations can take place?

peternied commented 3 months ago

Should this be split into multiple parts?

Yes please - ideally each change would be in isolation with clean before and after results that prove we are moving in the right direction.

stephen-crawford commented 3 months ago

I just finished going over the DlsFlsLeafReader in a similar manner to the DlsFlsValveImpl. There was less obvious changes to be made than in the latter package.

Ultimately some options I found were:

Swapping use of repeated substring creation with a variable.
Trying to condense the include and exclude sets to be inversly related--not sure if this will be possible but it would save storing duplicate objects.
Use array.copyOf instead of manually forming new arrays
Avoid stream use in ExtractMaskedFields
use .getComplianceConfig to analyze logging level before logging
Find a way for the binaryField logic to not need to convert back and forth between byte arrays and JSON maps
Potentially replace .anyMatch() if there is a more efficient option.

Now going to to break the changes for the DlsFlsValveImpl into the following:

Change Version | Change

I will share the test results of each change type after I make them.

The interface modifying changes I am not going to test at the moment since that requires more effort. I will see how much improvement these changes make and then if it is not much, will go forward with the interface changes.

stephen-crawford commented 3 months ago

These are the results of running the test suite after making the above change configurations.

As we can see, there is not an obvious pattern for what is the most effective way to improve the heap performance of the different query types.

I am not sure the next steps forward with improving this so going to reach out to @jainankitk to see if they have any reccomendations.

opensearch-project / security