opensearch-project / security

🔐 Secure your cluster with TLS, numerous authentication backends, data masking, audit logging as well as role-based access control on indices, documents, and fields
https://opensearch.org/docs/latest/security-plugin/index/
Apache License 2.0
191 stars 272 forks source link

[BUG] Field masking has inconsistent memory issues with certain queries #4031

Open peternied opened 8 months ago

peternied commented 8 months ago

What is the bug? We've seen reports of heap memory usage spiking with field masking enabled. With how masking is implemented at the leaf level its possible that certain types of queries cause the masked fields to be materialized even when they are not used.

How can one reproduce the bug? Steps to reproduce the behavior:

Baseline behavior

Baseline query with the following format:

final SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 106 888,814 3,621,248 191,136 480,198
reader 100 105 808,317 1,470,368 517,640 195,083
reader ROLE_WITH_NO_MASKING 100 105 813,600 1,602,272 536,416 206,571
reader MASKING_LOW_REPEAT_VALUE 100 105 936,948 1,631,088 618,704 195,653
reader MASKING_RANDOM_LONG 100 105 919,187 1,593,456 548,264 236,892
reader MASKING_RANDOM_STRING 100 105 951,826 1,656,432 564,784 191,008

Creating 3 Indices with 50000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 110 873,379 4,872,552 575,208 508,483
reader 100 109 753,489 4,020,832 567,304 341,413
reader ROLE_WITH_NO_MASKING 100 108 715,624 988,688 553,504 78,209
reader MASKING_LOW_REPEAT_VALUE 100 108 970,238 1,222,008 650,520 89,596
reader MASKING_RANDOM_LONG 100 108 942,349 1,215,448 679,768 115,690
reader MASKING_RANDOM_STRING 100 108 971,219 1,646,856 660,856 119,863

Query with Aggregate Filter

Query with the following format:

SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(AggregationBuilders.filters("my-filter", QueryBuilders.queryStringQuery("last")));
ssb.aggregation(AggregationBuilders.count("counting").field("genre.keyword"));
ssb.aggregation(AggregationBuilders.avg("averaging").field("longId"));
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 106 1,069,939 4,905,008 288,144 679,704
reader 100 105 877,725 1,604,048 562,032 215,923
reader ROLE_WITH_NO_MASKING 100 106 898,739 1,860,632 354,032 260,686
reader MASKING_LOW_REPEAT_VALUE 100 106 2,441,865 3,504,944 2,040,768 235,363
reader MASKING_RANDOM_LONG 100 106 2,500,135 3,215,712 1,984,096 247,712
reader MASKING_RANDOM_STRING 100 106 2,414,665 3,330,960 2,075,848 232,634

Creating 3 Indices with 50000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 113 1,265,168 2,698,248 993,528 301,134
reader 100 112 1,308,585 5,408,936 851,672 579,648
reader ROLE_WITH_NO_MASKING 100 109 1,021,555 1,504,480 798,400 136,735
reader MASKING_LOW_REPEAT_VALUE 100 114 2,420,922 7,183,000 1,828,456 660,701
reader MASKING_RANDOM_LONG 100 112 2,225,070 2,815,568 1,994,176 142,038
reader MASKING_RANDOM_STRING 100 111 2,169,297 2,673,376 1,964,144 128,983

Term Match Query:

Query with the following format:

SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(AggregationBuilders.filters("my-filter",  QueryBuilders.termQuery("title","last")));
ssb.aggregation(AggregationBuilders.count("counting").field("genre.keyword"));
ssb.aggregation(AggregationBuilders.avg("averaging").field("longId"));
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 106 1,154,269 2,497,744 441,920 353,223
reader 100 105 906,467 1,586,928 563,872 203,493
reader ROLE_WITH_NO_MASKING 100 105 873,926 1,390,368 453,184 199,539
reader MASKING_LOW_REPEAT_VALUE 100 105 1,184,290 1,857,944 840,680 218,853
reader MASKING_RANDOM_LONG 100 106 1,184,382 1,847,496 795,520 224,178
reader MASKING_RANDOM_STRING 100 105 1,162,373 1,811,936 782,312 215,824

Creating 3 Indices with 50000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 110 966,775 1,264,696 825,536 79,341
reader 100 109 971,058 1,393,544 779,968 95,795
reader ROLE_WITH_NO_MASKING 100 108 964,071 1,392,504 765,104 90,630
reader MASKING_LOW_REPEAT_VALUE 100 108 1,249,295 1,707,896 1,028,104 120,058
reader MASKING_RANDOM_LONG 100 109 1,225,250 1,717,792 982,744 111,090
reader MASKING_RANDOM_STRING 100 108 1,227,667 1,667,544 991,424 116,841
peternied commented 8 months ago

Analysis

Based on these results here are a couple of the findings. Overall, I would recommend avoiding StringQueries when using masked fields as there appears to be an disproportional use of memory compared to other search types. There are likely optimizations that could be added for some kinds of fields - such as building a cache of masked values to avoid the allocation tax, but its unclear if this would be helpful in high-cardinality scenarios.

The following are a couple of sources for this recommendations

GC Impact

The attempts column increases if GC was called during a test run, as that makes the memory calculation likely to be incorrect. So when attempts is higher, more GCs were performed - higher JVM usage. Baseline performs best, with Term being impacted slightly more often, and finally StringQuery seems to trip GC more frequently.

Masking Value Type

The complexity of the masking operation, masking an LONG vs a STRING did not appear to impact the usage, how ever there is a clear increase of between 15-30% memory usage when a field is masked in either Baseline or Term scenarios. This overhead is semi-expected due to the masking operations happening as the data is accessed.

Memory expense of StringQueries

When comparing Term vs StringQueries there is a 2x increase in memory usage when masked fields are involved, maybe additional rework that doesn't align with the behavoir pattern when looking at non-masked fields which are generally more memory expensive.

stephen-crawford commented 7 months ago

[Triage] Hi @peternied, thank you for the extremely detailed issue. You provided a lot of good details and we can take a couple of paths forward:

The first task is definitely actionable, the other two tasks can be investigated on an RFC or help wanted basis since it may not be too clear how to correct this quickly.

cwperks commented 4 months ago

@peternied I noticed that field masking is being applied on search requests where the request size is 0 and the only aggregations requested are count or cardinality. This may be an area that can be optimized to reduce the memory footprint of field masking.

This PR on my fork contains an example: https://github.com/cwperks/security/pull/24 https://github.com/opensearch-project/security/pull/4362

  1. Create an index and index some documents
Sample Documents ``` { "title": "Magnum Opus", "artist": "First artist", "lyrics": "Very deep subject", "stars": 1, "genre": "rock" }, { "title": "Song 1+1", "artist": "String", "lyrics": "Once upon a time", "stars": 2, "genre": "blues" }, { "title": "Next song", "artist": "Twins", "lyrics": "giant nonsense", "stars": 3, "genre": "jazz" }, { "title": "Poison", "artist": "No!", "lyrics": "Much too much", "stars": 4, "genre": "rock" }, { "title": "ABC", "artist": "First artist", "lyrics": "abcdefghijklmnopqrstuvwxyz", "stars": 7, "genre": "rock" } ```
  1. Assign a user to be able to search the index, but anonymize the artist field.

  2. Run a Search Request of size 0 with a cardinality aggregation (uniqueness count)

SearchRequest of size 0 with cardinality aggregation ``` POST /songs/_search { "size": 0, "aggs": { "unique_artists": { "cardinality": { "field": "artist.keyword" } } } } ```

For these types of queries, is applying the field masking necessary? It looks like extra overhead for a query that is already not requesting the sensitive info. If the size is >0 or has aggregations other than cardinality and count (such as max or min where raw data can be exposed) then the field masking should be applied.

stephen-crawford commented 3 months ago

Reproduced Steps:

baseline

3 indices with 5000 documents

Here's the updated table with the last row moved to the top:

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 108 861,267 1,333,568 697,264 126,293
admin 100 123 3,861,272 24,316,608 632,368 4,897,820
reader ROLE_WITH_NO_MASKING 100 108 681,459 943,384 504,032 74,876
reader MASKING_LOW_REPEAT_VALUE 100 107 663,067 835,976 524,560 63,470
reader MASKING_RANDOM_LONG 100 107 938,830 1,236,864 646,680 156,187
reader MASKING_RANDOM_STRING 100 107 966,501 1,299,488 601,336 138,535

3 indices with 50000 documents

Here's the updated table with the last row moved to the top:

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 106 982,778 1,271,616 717,296 134,142
admin 100 110 904,744 1,548,496 526,032 249,730
reader ROLE_WITH_NO_MASKING 100 107 667,823 1,063,920 446,520 74,430
reader MASKING_LOW_REPEAT_VALUE 100 107 665,148 878,424 529,000 69,972
reader MASKING_RANDOM_LONG 100 107 1,049,116 1,654,584 642,648 153,067
reader MASKING_RANDOM_STRING 100 107 984,706 1,248,968 566,432 145,058

aggregateFilterTermQueryScenarios

3 indices with 5000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 107 1,824,332 10,282,096 293,264 1,807,192
reader 100 105 744,310 1,247,896 474,384 180,697
reader ROLE_WITH_NO_MASKING 100 105 746,171 1,337,872 464,712 180,388
reader MASKING_LOW_REPEAT_VALUE 100 106 1,564,257 2,180,256 1,247,528 208,667
reader MASKING_RANDOM_LONG 100 106 1,385,117 2,133,552 848,360 283,312
reader MASKING_RANDOM_STRING 100 105 1,225,246 2,181,888 799,584 258,089

3 indices with 50000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 110 847,600 1,481,680 626,416 109,946
reader 100 108 869,619 1,576,624 621,760 180,665
reader ROLE_WITH_NO_MASKING 100 107 763,011 1,086,568 586,408 90,824
reader MASKING_LOW_REPEAT_VALUE 100 110 2,345,859 2,830,104 1,167,248 216,459
reader MASKING_RANDOM_LONG 100 110 2,249,719 2,523,008 2,058,280 81,800
reader MASKING_RANDOM_STRING 100 109 2,272,825 2,807,928 1,275,320 196,273

aggregateFilterStringQueryScenarios

3 indices with 5000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 111 4,848,198 29,831,272 812,496 6,303,649
reader 100 106 1,043,223 1,562,624 781,976 145,309
reader ROLE_WITH_NO_MASKING 100 106 1,014,046 1,419,760 718,328 143,819
reader MASKING_LOW_REPEAT_VALUE 100 110 3,269,142 3,852,016 2,437,888 190,123
reader MASKING_RANDOM_LONG 100 107 3,206,087 3,532,984 2,926,448 128,675
reader MASKING_RANDOM_STRING 100 108 3,184,283 3,886,000 2,569,528 239,173

3 indices with 50000 documents

Here's the provided data reformatted into the specified table format:

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 110 1,167,121 1,493,264 964,176 91,059
reader 100 107 1,064,978 1,516,824 747,968 155,176
reader ROLE_WITH_NO_MASKING 100 107 1,065,616 1,606,152 697,152 176,486
reader MASKING_LOW_REPEAT_VALUE 100 115 5,272,236 5,704,016 4,998,000 135,232
reader MASKING_RANDOM_LONG 100 113 5,387,294 5,730,312 5,162,648 114,408
reader MASKING_RANDOM_STRING 100 113 5,223,454 5,552,800 5,001,576 100,145

With Craig's Changes:

baseline

3 indices with 5000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 107 683,951 1,299,064 61,144 255,714
reader MASKING_RANDOM_STRING 100 108 724,074 1,193,984 532,120 121,375
reader MASKING_RANDOM_LONG 100 108 728,235 1,209,752 499,328 106,161
reader MASKING_LOW_REPEAT_VALUE 100 109 685,820 896,616 437,232 69,834
reader ROLE_WITH_NO_MASKING 100 109 692,526 943,160 512,136 72,825
admin 100 122 2,736,726 19,272,432 622,224 3,459,203

3 indices with 50000 documents

Here is the formatted data in a table:

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 107 706,473 1,086,544 528,256 90,361
reader MASKING_RANDOM_STRING 100 107 715,920 1,099,000 545,888 92,117
reader MASKING_RANDOM_LONG 100 106 706,888 1,255,768 517,880 109,191
reader MASKING_LOW_REPEAT_VALUE 100 107 797,238 1,289,392 533,688 156,511
reader ROLE_WITH_NO_MASKING 100 107 805,953 1,080,208 485,448 132,880
admin 100 108 702,158 1,174,728 531,128 91,654

aggregateFilterTermQueryScenarios

3 indices with 5000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 105 1,240,216 1,834,912 895,920 182,783
admin 100 108 5,719,315 41,066,960 636,096 8,486,929
reader ROLE_WITH_NO_MASKING 100 105 752,132 1,409,144 526,120 180,726
reader MASKING_LOW_REPEAT_VALUE 100 105 767,652 1,255,280 526,224 169,239
reader MASKING_RANDOM_LONG 100 105 1,290,737 2,092,840 916,792 219,425
reader MASKING_RANDOM_STRING 100 106 1,268,168 1,835,760 927,896 207,900

3 indices with 50000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 107 1,411,215 1,729,784 1,167,728 108,204
reader MASKING_RANDOM_STRING 100 107 1,458,550 1,748,408 1,289,344 81,905
reader MASKING_RANDOM_LONG 100 108 1,519,651 1,792,320 1,256,032 117,450
reader MASKING_LOW_REPEAT_VALUE 100 106 890,869 1,327,336 544,312 175,251
reader ROLE_WITH_NO_MASKING 100 108 904,850 1,295,392 677,272 135,768
admin 100 108 924,740 1,534,144 626,872 191,236

aggregateFilterStringQueryScenarios

3 indices with 5000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 112 3,187,505 3,573,144 2,712,240 160,524
reader MASKING_RANDOM_STRING 100 113 3,317,756 3,626,848 2,635,784 112,230
reader MASKING_RANDOM_LONG 100 115 3,341,987 3,840,784 3,120,224 151,827
reader MASKING_LOW_REPEAT_VALUE 100 108 1,065,400 1,433,872 763,416 148,780
reader ROLE_WITH_NO_MASKING 100 110 1,089,621 1,363,248 819,416 145,270
admin 100 125 2,498,277 14,298,976 279,928 2,153,253

3 indices with 50000 documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader 100 113 4,825,162 5,211,376 4,600,384 114,015
reader MASKING_RANDOM_STRING 100 114 4,923,908 5,256,120 4,749,248 111,796
reader MASKING_RANDOM_LONG 100 114 4,893,227 5,327,032 4,687,368 129,189
reader MASKING_LOW_REPEAT_VALUE 100 108 1,160,330 1,544,976 978,552 81,653
reader ROLE_WITH_NO_MASKING 100 108 1,166,856 1,414,224 883,584 91,983
admin 100 113 1,361,717 1,961,136 1,106,248 214,172
stephen-crawford commented 3 months ago

In the below image, I have compared the results of the original tables with the results of the tables with Craig's change. Red in the left table means the corresponding value is greater than the equivalent position in the right column tables. As shown the change does directly impact the testing scenarios covered by Peter's test framework for a majority of the measured values.

stephen-crawford commented 3 months ago

Screenshot 2024-06-25 at 1 43 23 PM

peternied commented 3 months ago

Focusing just on aggregateFilterStringQueryScenarios

Admin

Isn't lower memory usage better, admin seems to be doing better in the original vs the updated change? Seems like it was 1GB vs 4GB, that looks like a 400% regression

Reader with role MASKING_RANDOM_STRING

Looking at the difference between the changes just focusing in on MASKING_RANDOM_STRING, that seems like only 5% improvement, while that isn't nothing - its pretty close to the standard deviation of 111,796.

Original

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader MASKING_RANDOM_STRING 100 113 5,223,454 5,552,800 5,001,576 100,145

With Change

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
reader MASKING_RANDOM_STRING 100 114 4,923,908 5,256,120 4,749,248 111,796
cwperks commented 3 months ago

@scrawfor99 @peternied The change in https://github.com/cwperks/security/pull/24 is a very narrow change that applies to search requests that have size = 0 and only contain cardinality or count aggregations only. Its probably not representative to do the entire suite of tests when it applies narrowly in those 2 use-cases.

stephen-crawford commented 3 months ago

Hi @cwperks, of course. No worries. I am sure the change accomplishes what you intended. I just reran the tests to see if it would fix everything or not. It is no problem to add some more changes and see where that goes.

stephen-crawford commented 3 months ago

I went through the code and found some possible refactoring that may help improve the heap use based on the implementation.

I also found some lines I am not sure the purpose of...

First, we should avoid the expanding the aggregation builder factories if we can help it. I noticed we iterate over the list at least 3 separate times within DlsFlsValvleImpl alone. While this is a linear operation, if we are concerned about the heap use, we end up storing additional objects each time we do that so its not great.

We can also avoid the frequent conditionals used when identifying the aggregationBuilder type via instanceof by adding a setHint method to the AB interface all of the aggregations implement (this would happen in core).

We can avoid storing strings in the heap by logging only when the appropriate logger level is set. This would require a larger logger reword but instead of always logging all lines and only printing based on the configured level, we would only log lines which passed the level of configured logging. This matters more than it seems since in some areas we are logging complete requests as JSONs. This will take up a fair amount of space.

We can avoid checking each request for an update or resize request by adding a flag to the interface which allows us to fetch whether they are cache-able.

Likewise, we can have a request signal if it has source.profile or a min doc count of 0.

We can try to intern strings throughout the code.

We can swap to HashSets etc.

We can remove streams for loops since they are less memory efficient.

Condense the SetFLS which duplicates the serialization process done in the DLS logic.

Iws also not clear why we deserialized in this code:


    private void setDlsHeaders(EvaluatedDlsFlsConfig dlsFls, ActionRequest request) {
        if (!dlsFls.getDlsQueriesByIndex().isEmpty()) {
            Map<String, Set<String>> dlsQueries = dlsFls.getDlsQueriesByIndex();

            if (request instanceof ClusterSearchShardsRequest && HeaderHelper.isTrustedClusterRequest(threadContext)) {
                threadContext.addResponseHeader(
                    ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER,
                    Base64Helper.serializeObject((Serializable) dlsQueries)
                );
                if (log.isDebugEnabled()) {
                    log.debug("added response header for DLS info: {}", dlsQueries);
                }
            } else {
                if (threadContext.getHeader(ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER) != null) {
                    Object deserializedDlsQueries = Base64Helper.deserializeObject(
                        threadContext.getHeader(ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER),
                        threadContext.getTransient(ConfigConstants.USE_JDK_SERIALIZATION)
                    );
                    if (!dlsQueries.equals(deserializedDlsQueries)) {
                        throw new OpenSearchSecurityException(
                            ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER + " does not match (SG 900D)"
                        );
                    }
                } else {
                    threadContext.putHeader(
                        ConfigConstants.OPENDISTRO_SECURITY_DLS_QUERY_HEADER,
                        Base64Helper.serializeObject((Serializable) dlsQueries)
                    );
                    if (log.isDebugEnabled()) {
                        log.debug("attach DLS info: {}", dlsQueries);
                    }
                }
            }
        }
    }

Avoiding that at the does not match check would save some heap.

stephen-crawford commented 3 months ago

I will start going through making these changes and rerun the tests. Leaving the interface and logger changes for last.

cwperks commented 3 months ago

Thanks for the deep dive @scrawfor99! How are the performance metrics with the suggested changes implemented?

Should this be split into multiple parts?

  1. Overall optimizations that work for requests where Field Masking is applied
  2. Narrow use-cases where large optimizations can take place?
peternied commented 3 months ago

Should this be split into multiple parts?

Yes please - ideally each change would be in isolation with clean before and after results that prove we are moving in the right direction.

stephen-crawford commented 3 months ago

I just finished going over the DlsFlsLeafReader in a similar manner to the DlsFlsValveImpl. There was less obvious changes to be made than in the latter package.

Ultimately some options I found were:

Now going to to break the changes for the DlsFlsValveImpl into the following:

Change Version | Change


1 | Avoid aggregation builder expansion 2 | Remove logging 3 | Use Hash Sets 4 | Use loops instead of streams 5 | Condense setFls code

I will share the test results of each change type after I make them.

The interface modifying changes I am not going to test at the moment since that requires more effort. I will see how much improvement these changes make and then if it is not much, will go forward with the interface changes.

stephen-crawford commented 3 months ago
Screenshot 2024-06-28 at 12 24 07 PM

These are the results of running the test suite after making the above change configurations.

As we can see, there is not an obvious pattern for what is the most effective way to improve the heap performance of the different query types.

I am not sure the next steps forward with improving this so going to reach out to @jainankitk to see if they have any reccomendations.