opensearch-project / security

🔐 Secure your cluster with TLS, numerous authentication backends, data masking, audit logging as well as role-based access control on indices, documents, and fields
https://opensearch.org/docs/latest/security-plugin/index/
Apache License 2.0
180 stars 264 forks source link

[Decision Doc] Async Operations for Extensions #2574

Open cwperks opened 1 year ago

cwperks commented 1 year ago

This issue provides 4 options for enabling the use-case of running jobs asynchronously for extensions along with their pros and cons. A full solution for extensions long-term could combine a few of the options presented in this document. The 4 options are:

See each respective section for more details of each solution and the appendix for additional discussion on service account tokens for extensions to interact with their own system indices.

Problem Statement

The first milestone for extensions includes the conversion of the Anomaly Detection (AD) backend from a plugin to an extension. AD has a requirement to run jobs on behalf of a user on a schedule to monitor indices in a cluster for a detector. In the plugin model, AD serializes the user (including roles and backend roles) upon detector creation and saves it as part of the detector metadata. When it comes time to run the detector, the AD plugin performs roles injection to evaluate the permissions of a dedicated user (called plugin) with the roles stored in the detector’s metadata. This will not work for AD running as an extension because an extension will be treated as third-party and roles information will not be shared with the extension unless running in a legacy plugin compatibility mode. There will be no plugin equivalent to roles injection in favor of the extension submitting requests to the OpenSearch cluster that contain an auth token that can be used to authenticate and authorize each request.

More generally, outside of the Anomaly Detection use-case, the extensions and security team needs to provide guidance for extensions developers on how to implement extensions that want to interact with the OpenSearch cluster asynchronously. When writing software that utilizes an OpenSearch client, developers will typically configure the client with a username and password that is defined in the internal user list of OpenSearch. This will not work for extensions. The password and other sensitive information will remain internal to OpenSearch and never shared with an extension. An alternative approach for asynchronous jobs needs to be developed.

Options

Option 1: Auth Tokens

Auth Tokens are tokens that: 1) Confer access to the cluster and are associated with a user, 2) have a defined lifetime (could also be indefinite, but that is discouraged) 3) have authorizations associated with them that are a subset of the creator of the token at the time the token is created, 4) come with a management suite of APIs to Grant, Revoke, List and Search.

Pros:

Cons:

Option 2: API Keys

API Keys are: 1) A mechanism for services to authenticate and authorize, 2) can be indefinite or have a designated lifetime, 3) have authorizations associated with them that are a subset of the creator of the API Key, 4) may have additional restrictions on which APIs that ApiKey authentication is permissible for, 5) come with a management suite of APIs to Create, Invalidate, List and Search.

Pros:

Cons

Option 3: OAuth Tokens

OAuth Tokens are: 1) Short-lived access tokens and longer-liver refresh tokens that can be used to act on behalf of a user asynchronously, 2) Access token is only valid for a short window and refresh token can be utilized to get new access token and a new refresh token, 3) if user grants the extension to act on their behalf then the extension can store the refresh token and use it asynchronously to get access tokens for the user, 4) the user will grant the extension to act on its behalf, 5) the user needs to re-grant the extension the ability to act on its behalf at time of expiration of the absolute lifetime of the tokens, 5) admin has the ability to revoke authorization based on extension

Pros:

Cons

Option 4: Passing a short-lived token to an extension through job scheduler invocation (Recommended)

(Needs further analysis) An alternative to all 3 options above would be to generate a token in core and forward the token to the extension when the job is invoked. Job Scheduler is moving to core and there is a possibility that job scheduler will be the mechanism for triggering jobs registered by extensions. If this is the case, then a token can be created when the job is invoked and passed to the extension. Just-in-time tokens will be used for handling REST Requests so this is a similar pattern of issuing a token exactly when it is needed.

Pros:

Cons

Appendix

System Index Interaction

One of the challenges with converting plugins to extensions is the use-case of registering and interacting with system indices. Currently, plugins elevate their own privileges by stashing the ThreadContext and creating or writing to their system index in the cleared context. By clearing the context, plugins are able to in effect act as a superuser and bypass checks in the Security plugin. This will not work for extensions as they will initially be run out-of-process.

For extensions, there needs to be a mechanism to allow the extension to interact with the cluster as itself to create and write to its own indices for storage if the extension developer chooses to use OpenSearch for persistence. One solution for this could be Service Accounts and tokens for those accounts where the extension can bear a token that represents itself when making requests to the OpenSearch cluster. This token would give it hyper-localized privileges to only be able to interact with its own system indices and no others.

Additional Considerations

References

  1. https://github.com/opensearch-project/security/issues/1504 - Request in the Security backlog for API Keys
peternied commented 1 year ago

I'm sold on Just-in-Time tokens

Additionally choosing Just-in-Time tokens, does not preclude other authentication schemas at a future time.

I'd love for to see plans for implementation, let me know if I can help get facilitate.


Issue feedback: Thanks for writing this up, I would recommend doing some tightening/rearranging

I don't think this title represents what this issue discusses and recommends - is there a meta issue around support for async tasks, then could we rename the issue?

In the Problem statement - can you slim down this section? The four options you are discussion are how authentication information is passed to an OpenSearch cluster from an extension - all the extra detail is makes this hard to parse. What do you think about moving these extra detail to an apendx/additional context section at the end.

In Option 4, there is a comment (Needs further analysis). Is this still accurate, if so what additional information do you need?

peternied commented 1 year ago

@cwperks From our discussion this afternoon

As per the discussion in https://github.com/opensearch-project/OpenSearch/issues/5310, it looks like the Job Scheduler might become a core feature in OpenSearch. We need to figure out how the Job Scheduler will store the principals of the job that is going to execute. E.g. should we need to encrypt those principals as encrypted principal identifier tokens?

I have a few questions related to Job Scheduler authentication that I hope we can address:

cwperks commented 1 year ago

Jotting down a few notes for how this works in the plugin architecture of OpenSearch. Primarily, Jobs running asynchronously that want to simulate a user's privileges will utilize Roles Injection and create an in-memory user called plugin and evaluate that user's privileges with the roles that were injected. In the case for Anomaly Detection, the roles injected into the ThreadContext are the user's roles (post roles resolution) and the request initiated by the plugin runs in that context.

Relevant areas to look at:

Where does AD store the user info when a detector is created?

A: Anomaly Detection stores this info with the detector metadata in the ANOMALY_DETECTORS_INDEX = .opendistro-anomaly-detectors

See: https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/java/org/opensearch/ad/rest/handler/AbstractAnomalyDetectorActionHandler.java#L795-L823

How does Job Scheduler work in the plugin model and how is it connected to AD?

A: JobScheduler is an ExtensiblePlugin meaning that other plugins can extend an interface created by JobScheduler. The extension point is called JobSchedulerExtension and implementers of this extension point must implement 4 methods: 1. getJobType(), 2. getJobIndex(), 3. getJobRunner() and 4. getJobParser()

See the extensible interface here: https://github.com/opensearch-project/job-scheduler/blob/main/spi/src/main/java/org/opensearch/jobscheduler/spi/JobSchedulerExtension.java

The getJobIndex() is important here because this is where JS expects to find information about the schedule of the job. (*: Job Scheduler does not own this index - this is what JS is told to read from.)

See how AD extends this extension point here: https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/java/org/opensearch/ad/AnomalyDetectorPlugin.java#L1011-L1027

@Override
public String getJobType() {
    return AD_JOB_TYPE;
}

@Override
public String getJobIndex() {
    return AnomalyDetectorJob.ANOMALY_DETECTOR_JOB_INDEX;
}

@Override
public ScheduledJobRunner getJobRunner() {
    return AnomalyDetectorJobRunner.getJobRunnerInstance();
}

@Override
public ScheduledJobParser getJobParser() {
    return (parser, id, jobDocVersion) -> {
        XContentParserUtils.ensureExpectedToken(XContentParser.Token.START_OBJECT, parser.nextToken(), parser);
        return AnomalyDetectorJob.parse(parser);
    };
}

ANOMALY_DETECTOR_JOB_INDEX is .opendistro-anomaly-detector-jobs - this index is in the list of system indices of the security plugin: https://github.com/opensearch-project/security/blob/main/tools/install_demo_configuration.sh#L386

How does the Job Scheduler job sweeper work?

The sweep iterates through the list of plugins that have extended the JobSchedulerExtension and keeps a registry of indexToProviders - a map from the registered indices of getJobIndex to a SchedulerJobProvider. For each iteration the sweeper does a SearchRequest (https://github.com/opensearch-project/job-scheduler/blob/main/src/main/java/org/opensearch/jobscheduler/sweeper/JobSweeper.java#L391-L400) and parses the results with the getJobParser() implementation. (Note: I am not positive how this SearchRequest is authorized since its searching a system index - I will follow-up with more details)

This is the line that parses the job: https://github.com/opensearch-project/job-scheduler/blob/main/src/main/java/org/opensearch/jobscheduler/sweeper/JobSweeper.java#L262

ScheduledJobParameter jobParameter = provider.getJobParser().parse(parser, docId, jobDocVersion);

See relevant lines in AnomalyDetectorJob.parse that parses out the schedule: https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/java/org/opensearch/ad/model/AnomalyDetectorJob.java#L192-L194

case SCHEDULE_FIELD: // SCHEDULE_FIELD = schedule
    schedule = ScheduleParser.parse(parser); // ScheduleParser is imported from JS
    break;

Note: The job parser also parses the user from the persistence document for the job in the ad jobs index.

Below is an example from reporting of what this schedule field looks like in a document that represents a job definition: https://github.com/opensearch-project/reporting/blob/main/src/main/kotlin/org/opensearch/reportsscheduler/model/ReportDefinition.kt#L43-L52

 *       "schedule":{ // required when triggerType is CronSchedule or IntervalSchedule
 *           "cron":{ // required when triggerType is CronSchedule
 *               "expression":"0 * * * *",
 *               "timezone":"PST"
 *           },
 *           "interval":{ // required when triggerType is IntervalSchedule
 *               "start_time":1603506908773,
 *               "period":10",
 *               "unit":"Minutes"
 *           }
 *       }  

a few lines down it also calls getJobRunner(): https://github.com/opensearch-project/job-scheduler/blob/main/src/main/java/org/opensearch/jobscheduler/sweeper/JobSweeper.java#L267

ScheduledJobRunner jobRunner = this.indexToProviders.get(shardId.getIndexName()).getJobRunner();

The JobRunner is a singleton class that implements JS' ScheduledJobRunner: https://github.com/opensearch-project/job-scheduler/blob/main/spi/src/main/java/org/opensearch/jobscheduler/spi/ScheduledJobRunner.java

This is a simple interface that has one method to implement: runJob

For getJobType AD simply returns opendistro_anomaly_detector.

cwperks commented 1 year ago

@peternied @opensearch-project/security Thank you for the help with the decision to go with just-in-time (JWT) tokens for the design. Just-in-time tokens will have a very similar flow to extension REST handlers that @RyanL1997 is leading so there will be a great amount of re-use. I have some follow-up decisions that I'd like to make to finish scoping out work items and here is what I propose to get this accomplished.

tldr; version Since jobs are created via extension REST handlers, I am proposing that for handlers responsibility for creating jobs that the security plugin provides a refresh token to the extension to store with the schedule definition for the job.

On job scheduler invocation, JS will ask security to issue a new access token given this refresh token. The access token will be forwarded to the extension so that the job can run.

The refresh token cannot be used for REST requests, its only use is for JS requesting a new access token on behalf of a user.


Longer version

Extensions will have a special index to store jobs definitions in. Part of those jobs definitions includes information about the schedule for JS. As part of this change the jobs definition will also store this refresh token as part of the schedule.

 *       "schedule":{ // required when triggerType is CronSchedule or IntervalSchedule
 *           "refresh_token": "<refresh_token>"
 *           "cron":{ // required when triggerType is CronSchedule
 *               "expression":"0 * * * *",
 *               "timezone":"PST"
 *           },
 *           "interval":{ // required when triggerType is IntervalSchedule
 *               "start_time":1603506908773,
 *               "period":10",
 *               "unit":"Minutes"
 *           }
 *       }  

JS will parse this schedule including the refresh token. A new interface for the plugin will be exposed in core via the IdentityPlugin interface. This will allow JS to request the issuance of a new access token to forward to the extension.

Initially, I am thinking that refresh tokens are one-time use JWTs that have a claim that identifies it as a refresh token. The encrypted (mapped) roles and backend roles will be claims in the refresh token so that they can be used to add the claims back into the new access token.

One of the challenges I am currently facing is how JS and the security plugin will interact so that JS and request the issuance of a new access token. I will provide more details shortly, this is a preliminary design.

To implement the refresh token as one-time use, I propose storing valid refresh tokens in a special security index with documents keyed by the job Id

/opendistro_security_refresh_tokens
{
  "id": 1,
  "jobId": "<jobId>",
  "currentRefreshToken": "<current_refresh_token>"
},
{
  "id": 2,
  "jobId": "<jobId2>",
  "currentRefreshToken": "<current_refresh_token>"
},
...
peternied commented 1 year ago

tldr; version Since jobs are created via extension REST handlers, I am proposing that for handlers responsibility for creating jobs that the security plugin provides a refresh token to the extension to store with the schedule definition for the job.

It sounds like this is a feature along the lines of "If job scheduler triggers an extension to run a job, and the job takes a longer than the expiration time on its JIT token, how does the extension communicate about the job state to the cluster?"

I think this answer will depend on what controls are on the JIT expiry value, how Administrators of the cluster want to allow late/longer running jobs to finish (fail/continue?), and how important this scenario is to extensions developers. What do you think about recording your design considerations against this feature seperately?

cwperks commented 1 year ago

@peternied I will add an issue about how to determine how long a token should be valid for. For some jobs, there is an expected run time which is what I am initially thinking for determining the token's expiration. The security plugin will enforce a maximum amount of minutes that auth tokens can be valid for.