[FEATURE] Search on Pattern-based field masking fields

antoniocascais commented 2 years ago

Describe the bug A document exists with field log and text FOO, but when I try to search for FOO I get 0 results back.

To Reproduce Steps to reproduce the behavior:

Search "FOO" and get nothing back.

Expected behavior I would expect the existing document to be found.

Plugins

Query Workbench
Reporting
- Alerting
- Anomaly Detection
- Observability
- Index Management

Screenshots

The document exists:

When I try to search on it, 0 results are found (also same behavior if I use the Opensearch-Dashboards UI instead of the search api):

Host/Environment (please complete the following information):

OS: ArchLinux
Firefox v103
Opensearch service from AWS

Additional context If I do the exact same search on another index, I get results back. So I guess the issue is related with this index somehow, but no idea what's happening.

rockybean commented 2 years ago

Can you share your index mappings? These unexpected search result is often caused by incorrect tokenization.

You can get detailed tokenized results via below api. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

antoniocascais commented 2 years ago

Hi @rockybean , thanks for getting back to me.

The analyze looks ok to me:

GET logs-000040/_analyze
{
  "analyzer": "standard",
  "text": "coinbase"
}

# response
{
  "tokens" : [
    {
      "token" : "coinbase",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

And the mappings on the field I'm searching also look good to me:

GET logs-000040/_mapping/field/log

#response
{
  "logs-000040" : {
    "mappings" : {
      "log" : {
        "full_name" : "log",
        "mapping" : {
          "log" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

Can you see something wrong here?

rockybean commented 2 years ago

Hi, @antoniocascais

I test with below script and all is going well.


POST _analyze
{
  "analyzer": "standard",
  "text": ["2022/10/2508:38:09({name: coinbase,timestamp: 2022-10-25 08:38:09.510236561+0000UTC"]
}

POST logs-test/_doc
{
  "log":"2022/10/2508:38:09({name: coinbase,timestamp: 2022-10-25 08:38:09.510236561+0000UTC"
}

GET logs-test/_mapping

GET logs-test/_search
{
  "query":{
    "match": {
      "log": "coinbase"
    }
  }
}

There should be something wrong with mapping as most of these kind issues are caused by incorrect analyzer setting.

You can also use termvector api to get analyzed tokens from current doc. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

Hope this can help.

antoniocascais commented 2 years ago

Hi @rockybean .

Once again, thank you for the help.

I have narrowed down my problem: i created 2 new indices, one for env A and one for env B. Then I send logs from my application to those 2 indices (i mean, application running in env A sends to index A, app running in env B sends to index B). The mappings of these 2 indices are exactly the same. However, env A returns results when I search on it, env B doesn't.

I'm getting out of ideas on how to look into this problem. Do you have any suggestion on how to find out what's going on?

edit: ok, finally figured out what was the source of the problem! We were applying some masking rules to hide ip addresses, like mentioned here: https://opensearch.org/docs/latest/security-plugin/access-control/field-masking/

However, shouldn't the masking fields option just replace one string (in this case, the ip address) with another string and that's it? Why can't we search on the field if some values of it are masked?

rockybean commented 2 years ago

not familiar with field mask functionality and will test on my own env later.

dblock commented 2 years ago

Thanks @rockybean for your help here!

I renamed this issue to a feature request "Search on masked fields", ie. "shouldn't the masking fields option just replace one string (in this case, the ip address) with another string" for now. But isn't that how it works today? When you say "can't search", you mean that you're not finding anything? That's because the text is garbage, field masking today replaces field values with a cryptographic hash.

@antoniocascais Assuming the above works as expected, is there anything you think should change in the current behavior that would have made it easier for you to find the issue, or could be valuable as a feature?

antoniocascais commented 2 years ago

@dblock in the docs it has the following example (https://opensearch.org/docs/latest/security-plugin/access-control/field-masking/#advanced-pattern-based-field-masking):

      _masked_fields_:
      - "title::/./::*"
      - "genres::/^[a-zA-Z]{1,3}/::XXX::/[a-zA-Z]{1,3}$/::YYY"

The genres statement changes the first three characters of the string to XXX and the last three characters to YYY.

By reading this, I would expect I can still search in the genres field. For example, if my genres field has a value of SPOKY FOO MOVIES I'm expecting that it is masked to XXXKY FOO MOVYYY

Then I'm expecting that if I search for FOO I actually get this result back. Is my assumption wrong?

dblock commented 2 years ago

Let's move this into the security repo that implements masked fields. I agree with your expectations. I think next steps would be to write a unit/integration test that reproduces this.

cwperks commented 2 years ago

[Triage] Updating the name of this issue to Search on Pattern-based field masking fields. @scrawfor99 would you please take a look into this issue?

cliu123 commented 2 years ago

When I read the documentation, I see the following in the index permission of the role. @antoniocascais Could you confirm if this is something missing in your configuration?

      '*':
      - "READ"

antoniocascais commented 1 year ago

Hi @cliu123 .

I do have READ permissions on the index. I can search in other fields that exist in the index, just not the masked one.

antoniocascais commented 1 year ago

Any news on this topic?

frodeb commented 1 year ago

Hello @cwperks & @scrawfor99 - any news on this topic? We're facing the same issue as @antoniocascais with the pattern-based field masking.

MaciejMierzwa commented 1 year ago

I'll start working on it

MaciejMierzwa commented 1 year ago

Hi, so after some analysis it looks like it might be pretty challenging to implement this feature. The reason why OpenSearch doesn't allow searching through the masked fields is because of how the search feature is implemented. OpenSearch uses an inverted index for indexing documents and performing full-text search.

The simplified version is: after a document is put into an index longer fields are split, sorted and the engine creates a lookup table with a map containing keywords and pointers to documents in which the keyword occurs. It allows for a quick search. Because of this search mechanism for some specific masks it would be possible to implement this feature, but in most of popular cases it's not easy. During indexing Opensearch is not interested in the order of the words in text. There is no easy way to tell if the string should be hidden by particular masking rule. Some common examples are finding specific phrase and filtering out everything that follows, or filtering out first/last characters of longer text.

For user example, if the genres field has a value of `SPOKY FOO MOVIES` it's expected that it is masked to `XXXKY FOO MOVYYY`. The inverted index will would tokenize the expression into set of strings each pointing to the same document. If we allow search through masked fields an attacker without direct access could create own lookup table and discover what data is in the field.	Token	Document
FOO	1
MOVIES	1
MOVIE (stemming from word MOVIES)	1
SPOOKY	1

No indication which token should be masked is created

I came up with 2 options for how to combat this, both have pretty heavy drawbacks.

1: We could allow to search through masked fields as shown in example, after index returns set of documents we could run masking regex on each of them and double check if string we're searching for is masked or not. This would be very slow solution, If keyword we're searching for exists in most of the database documents it would make whole search n-times slower. With number of documents that are stored search time would also grow. 2: There could be a special index created that would tokenize masked fields. It would allow to search masked fields, but again there are some heavy drawbacks. One is the additional space used by this index. Another is the amount of indexing. Every time a masking field rule is modified or added, all documents in the cluster would need to be re-analyzed. So this idea would also probably introduce performance issues.

Both solutions are far from perfect, I think the best way to manage this issue is to use already existing mechanisms. During document creation tokens that require anonymization could be separated and masked into separate fields. On the other hand parts of messages that need to be searchable could be also separated and then an inverted index would do its job.

peternied commented 1 year ago

@MaciejMierzwa Thanks for the time to write that response up, ultimately I think there is a fundamental problems with the relationship between masked fields and indexed data. Masking is done post indexing process, its dynamically applied to queries. With the masking rules being changeable, this would require a 'hidden' re-index code on every change change to the rules - depending on index size this could create a massive 'surprise' load.

The alternative approach, which is to take data for one index, transform it (apply a masking) onto a second index, and then provide access to those indexes separately is much easier, and I would advise 'established' masking rules are applied via a transformation process that is controlled by the operators.

I don't think there is a way to implement either solution without significant tradeoffs in storage / computational complexity

opensearch-project / security

[FEATURE] Search on Pattern-based field masking fields #2220