[Decision Doc] Async Operations for Extensions

This issue provides 4 options for enabling the use-case of running jobs asynchronously for extensions along with their pros and cons. A full solution for extensions long-term could combine a few of the options presented in this document. The 4 options are:

Auth Tokens
API Keys
OAuth Tokens
Passing a token upon job invocation via Job Scheduler (Just in Time token)

See each respective section for more details of each solution and the appendix for additional discussion on service account tokens for extensions to interact with their own system indices.

Problem Statement

The first milestone for extensions includes the conversion of the Anomaly Detection (AD) backend from a plugin to an extension. AD has a requirement to run jobs on behalf of a user on a schedule to monitor indices in a cluster for a detector. In the plugin model, AD serializes the user (including roles and backend roles) upon detector creation and saves it as part of the detector metadata. When it comes time to run the detector, the AD plugin performs roles injection to evaluate the permissions of a dedicated user (called plugin) with the roles stored in the detector’s metadata. This will not work for AD running as an extension because an extension will be treated as third-party and roles information will not be shared with the extension unless running in a legacy plugin compatibility mode. There will be no plugin equivalent to roles injection in favor of the extension submitting requests to the OpenSearch cluster that contain an auth token that can be used to authenticate and authorize each request.

More generally, outside of the Anomaly Detection use-case, the extensions and security team needs to provide guidance for extensions developers on how to implement extensions that want to interact with the OpenSearch cluster asynchronously. When writing software that utilizes an OpenSearch client, developers will typically configure the client with a username and password that is defined in the internal user list of OpenSearch. This will not work for extensions. The password and other sensitive information will remain internal to OpenSearch and never shared with an extension. An alternative approach for asynchronous jobs needs to be developed.

Options

Option 1: Auth Tokens

Auth Tokens are tokens that: 1) Confer access to the cluster and are associated with a user, 2) have a defined lifetime (could also be indefinite, but that is discouraged) 3) have authorizations associated with them that are a subset of the creator of the token at the time the token is created, 4) come with a management suite of APIs to Grant, Revoke, List and Search.

Pros:

Changes to clients should not be required as it utilizes Bearer authentication, a recognized standard
Have scopes which limit their capabilities
Can utilize the existing JWT Authentication Backend in the security plugin

Cons:

Long-lived tokens are discouraged
UX considerations for token expiration scenarios
If roles can be designated for the token at time of creation, then it could lead to confusing user experiences if the underlying role’s permissions change and ramifications for a previously created token.

Option 2: API Keys

API Keys are: 1) A mechanism for services to authenticate and authorize, 2) can be indefinite or have a designated lifetime, 3) have authorizations associated with them that are a subset of the creator of the API Key, 4) may have additional restrictions on which APIs that ApiKey authentication is permissible for, 5) come with a management suite of APIs to Create, Invalidate, List and Search.

Pros:

Do not require rotation for expiry (could be a con as well)
Highly requested feature of the Security plugin: https://github.com/opensearch-project/security/issues/1504
API Keys could be utilized to improve current plugin security by providing an off-ramp for stashing the ThreadContext
API Keys are considered simple compared to OAuth tokens

Cons

API Keys are not generally considered secure
If compromised, an API Key has an indefinite life and will confer access until explicitly revoked by an administrator
API Keys identify projects, not users
May require client modification to support ApiKey authentication

Option 3: OAuth Tokens

OAuth Tokens are: 1) Short-lived access tokens and longer-liver refresh tokens that can be used to act on behalf of a user asynchronously, 2) Access token is only valid for a short window and refresh token can be utilized to get new access token and a new refresh token, 3) if user grants the extension to act on their behalf then the extension can store the refresh token and use it asynchronously to get access tokens for the user, 4) the user will grant the extension to act on its behalf, 5) the user needs to re-grant the extension the ability to act on its behalf at time of expiration of the absolute lifetime of the tokens, 5) admin has the ability to revoke authorization based on extension

Pros:

Allows explicit consent by the user to share data with third-party
A refresh_token that changes periodically solves for issues with API Keys having indefinite access
Generally recommended as a best practice

Cons

More complicated than API Key security - TODO: expand upon this with scope of authorization_code flow and addition work items
Requires client modification to work with OAuth2 token workflows
Extensions developers need guidance on how to develop with OAuth tokens

Option 4: Passing a short-lived token to an extension through job scheduler invocation (Recommended)

(Needs further analysis) An alternative to all 3 options above would be to generate a token in core and forward the token to the extension when the job is invoked. Job Scheduler is moving to core and there is a possibility that job scheduler will be the mechanism for triggering jobs registered by extensions. If this is the case, then a token can be created when the job is invoked and passed to the extension. Just-in-time tokens will be used for handling REST Requests so this is a similar pattern of issuing a token exactly when it is needed.

Pros:

Similar to support for Extension REST Handlers and will allow for code re-use
No client modification and consistent behavior with Extension REST Handlers
More secure approach as core will determine how long the lifetime of the token forwarded to the extension will be

Cons

Long-living jobs may require longer-lived tokens
Still no mechanism for services to connect with cluster outside of client being registered with a username and password
Tokens are not revocable

Appendix

System Index Interaction

One of the challenges with converting plugins to extensions is the use-case of registering and interacting with system indices. Currently, plugins elevate their own privileges by stashing the ThreadContext and creating or writing to their system index in the cleared context. By clearing the context, plugins are able to in effect act as a superuser and bypass checks in the Security plugin. This will not work for extensions as they will initially be run out-of-process.

For extensions, there needs to be a mechanism to allow the extension to interact with the cluster as itself to create and write to its own indices for storage if the extension developer chooses to use OpenSearch for persistence. One solution for this could be Service Accounts and tokens for those accounts where the extension can bear a token that represents itself when making requests to the OpenSearch cluster. This token would give it hyper-localized privileges to only be able to interact with its own system indices and no others.

Additional Considerations

Tokens associated with a user should be automatically revoked when the user’s authorizations have changed or if the user’s password has changed. (May be too draconian - needs more thought/configuration)
To invalidate a JWT before expiry means that a list of invalidated tokens needs to be maintained to check to see if a token has been invalidated, otherwise the token is valid until expiry.

References

https://github.com/opensearch-project/security/issues/1504 - Request in the Security backlog for API Keys

I'm sold on Just-in-Time tokens

Aligns with On-Behalf-Of pattern discussed in a non on GitHub document
Zero human overhead
Stateless

Additionally choosing Just-in-Time tokens, does not preclude other authentication schemas at a future time.

I'd love for to see plans for implementation, let me know if I can help get facilitate.

Issue feedback: Thanks for writing this up, I would recommend doing some tightening/rearranging

I don't think this title represents what this issue discusses and recommends - is there a meta issue around support for async tasks, then could we rename the issue?

In the Problem statement - can you slim down this section? The four options you are discussion are how authentication information is passed to an OpenSearch cluster from an extension - all the extra detail is makes this hard to parse. What do you think about moving these extra detail to an apendx/additional context section at the end.

In Option 4, there is a comment (Needs further analysis). Is this still accurate, if so what additional information do you need?

@cwperks From our discussion this afternoon

As per the discussion in https://github.com/opensearch-project/OpenSearch/issues/5310, it looks like the Job Scheduler might become a core feature in OpenSearch. We need to figure out how the Job Scheduler will store the principals of the job that is going to execute. E.g. should we need to encrypt those principals as encrypted principal identifier tokens?

I have a few questions related to Job Scheduler authentication that I hope we can address:

What happens when a plugin is called by Job Scheduler to execute a job and there is an principal associated with it?
How do we know which extension the JIT token is issued for when an extension is called by Job Scheduler to execute a job?
What happens if the user is deleted, or if there are error states around authentication?

Jotting down a few notes for how this works in the plugin architecture of OpenSearch. Primarily, Jobs running asynchronously that want to simulate a user's privileges will utilize Roles Injection and create an in-memory user called plugin and evaluate that user's privileges with the roles that were injected. In the case for Anomaly Detection, the roles injected into the ThreadContext are the user's roles (post roles resolution) and the request initiated by the plugin runs in that context.

Relevant areas to look at:

Common-utils.injectRoles - https://github.com/opensearch-project/common-utils/blob/main/src/main/java/org/opensearch/commons/InjectSecurity.java#L126-L140
PrivilegesEvaluator.setUserInfoInThreadContext - https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/privileges/PrivilegesEvaluator.java#L198-L210
AnomalyDetectorJobRunner.runAnomalyDetectionJob - https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/java/org/opensearch/ad/AnomalyDetectorJobRunner.java#L298-L353

Where does AD store the user info when a detector is created?

A: Anomaly Detection stores this info with the detector metadata in the ANOMALY_DETECTORS_INDEX = .opendistro-anomaly-detectors

See: https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/java/org/opensearch/ad/rest/handler/AbstractAnomalyDetectorActionHandler.java#L795-L823

How does Job Scheduler work in the plugin model and how is it connected to AD?

A: JobScheduler is an ExtensiblePlugin meaning that other plugins can extend an interface created by JobScheduler. The extension point is called JobSchedulerExtension and implementers of this extension point must implement 4 methods: 1. getJobType(), 2. getJobIndex(), 3. getJobRunner() and 4. getJobParser()

See the extensible interface here: https://github.com/opensearch-project/job-scheduler/blob/main/spi/src/main/java/org/opensearch/jobscheduler/spi/JobSchedulerExtension.java

The getJobIndex() is important here because this is where JS expects to find information about the schedule of the job. (*: Job Scheduler does not own this index - this is what JS is told to read from.)

See how AD extends this extension point here: https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/java/org/opensearch/ad/AnomalyDetectorPlugin.java#L1011-L1027

@Override
public String getJobType() {
    return AD_JOB_TYPE;
}

@Override
public String getJobIndex() {
    return AnomalyDetectorJob.ANOMALY_DETECTOR_JOB_INDEX;
}

@Override
public ScheduledJobRunner getJobRunner() {
    return AnomalyDetectorJobRunner.getJobRunnerInstance();
}

@Override
public ScheduledJobParser getJobParser() {
    return (parser, id, jobDocVersion) -> {
        XContentParserUtils.ensureExpectedToken(XContentParser.Token.START_OBJECT, parser.nextToken(), parser);
        return AnomalyDetectorJob.parse(parser);
    };
}

ANOMALY_DETECTOR_JOB_INDEX is .opendistro-anomaly-detector-jobs - this index is in the list of system indices of the security plugin: https://github.com/opensearch-project/security/blob/main/tools/install_demo_configuration.sh#L386

How does the Job Scheduler job sweeper work?

The sweep iterates through the list of plugins that have extended the JobSchedulerExtension and keeps a registry of indexToProviders - a map from the registered indices of getJobIndex to a SchedulerJobProvider. For each iteration the sweeper does a SearchRequest (https://github.com/opensearch-project/job-scheduler/blob/main/src/main/java/org/opensearch/jobscheduler/sweeper/JobSweeper.java#L391-L400) and parses the results with the getJobParser() implementation. (Note: I am not positive how this SearchRequest is authorized since its searching a system index - I will follow-up with more details)

This is the line that parses the job: https://github.com/opensearch-project/job-scheduler/blob/main/src/main/java/org/opensearch/jobscheduler/sweeper/JobSweeper.java#L262

ScheduledJobParameter jobParameter = provider.getJobParser().parse(parser, docId, jobDocVersion);

See relevant lines in AnomalyDetectorJob.parse that parses out the schedule: https://github.com/opensearch-project/anomaly-detection/blob/main/src/main/java/org/opensearch/ad/model/AnomalyDetectorJob.java#L192-L194

case SCHEDULE_FIELD: // SCHEDULE_FIELD = schedule
    schedule = ScheduleParser.parse(parser); // ScheduleParser is imported from JS
    break;

Note: The job parser also parses the user from the persistence document for the job in the ad jobs index.

Below is an example from reporting of what this schedule field looks like in a document that represents a job definition: https://github.com/opensearch-project/reporting/blob/main/src/main/kotlin/org/opensearch/reportsscheduler/model/ReportDefinition.kt#L43-L52

 *       "schedule":{ // required when triggerType is CronSchedule or IntervalSchedule
 *           "cron":{ // required when triggerType is CronSchedule
 *               "expression":"0 * * * *",
 *               "timezone":"PST"
 *           },
 *           "interval":{ // required when triggerType is IntervalSchedule
 *               "start_time":1603506908773,
 *               "period":10",
 *               "unit":"Minutes"
 *           }
 *       }

a few lines down it also calls getJobRunner(): https://github.com/opensearch-project/job-scheduler/blob/main/src/main/java/org/opensearch/jobscheduler/sweeper/JobSweeper.java#L267

ScheduledJobRunner jobRunner = this.indexToProviders.get(shardId.getIndexName()).getJobRunner();

The JobRunner is a singleton class that implements JS' ScheduledJobRunner: https://github.com/opensearch-project/job-scheduler/blob/main/spi/src/main/java/org/opensearch/jobscheduler/spi/ScheduledJobRunner.java

This is a simple interface that has one method to implement: runJob

For getJobType AD simply returns opendistro_anomaly_detector.

@peternied @opensearch-project/security Thank you for the help with the decision to go with just-in-time (JWT) tokens for the design. Just-in-time tokens will have a very similar flow to extension REST handlers that @RyanL1997 is leading so there will be a great amount of re-use. I have some follow-up decisions that I'd like to make to finish scoping out work items and here is what I propose to get this accomplished.

tldr; version Since jobs are created via extension REST handlers, I am proposing that for handlers responsibility for creating jobs that the security plugin provides a refresh token to the extension to store with the schedule definition for the job.

On job scheduler invocation, JS will ask security to issue a new access token given this refresh token. The access token will be forwarded to the extension so that the job can run.

The refresh token cannot be used for REST requests, its only use is for JS requesting a new access token on behalf of a user.

Longer version

Extensions will have a special index to store jobs definitions in. Part of those jobs definitions includes information about the schedule for JS. As part of this change the jobs definition will also store this refresh token as part of the schedule.

 *       "schedule":{ // required when triggerType is CronSchedule or IntervalSchedule
 *           "refresh_token": "<refresh_token>"
 *           "cron":{ // required when triggerType is CronSchedule
 *               "expression":"0 * * * *",
 *               "timezone":"PST"
 *           },
 *           "interval":{ // required when triggerType is IntervalSchedule
 *               "start_time":1603506908773,
 *               "period":10",
 *               "unit":"Minutes"
 *           }
 *       }

JS will parse this schedule including the refresh token. A new interface for the plugin will be exposed in core via the IdentityPlugin interface. This will allow JS to request the issuance of a new access token to forward to the extension.

Initially, I am thinking that refresh tokens are one-time use JWTs that have a claim that identifies it as a refresh token. The encrypted (mapped) roles and backend roles will be claims in the refresh token so that they can be used to add the claims back into the new access token.

One of the challenges I am currently facing is how JS and the security plugin will interact so that JS and request the issuance of a new access token. I will provide more details shortly, this is a preliminary design.

To implement the refresh token as one-time use, I propose storing valid refresh tokens in a special security index with documents keyed by the job Id

/opendistro_security_refresh_tokens
{
  "id": 1,
  "jobId": "<jobId>",
  "currentRefreshToken": "<current_refresh_token>"
},
{
  "id": 2,
  "jobId": "<jobId2>",
  "currentRefreshToken": "<current_refresh_token>"
},
...

tldr; version Since jobs are created via extension REST handlers, I am proposing that for handlers responsibility for creating jobs that the security plugin provides a refresh token to the extension to store with the schedule definition for the job.

It sounds like this is a feature along the lines of "If job scheduler triggers an extension to run a job, and the job takes a longer than the expiration time on its JIT token, how does the extension communicate about the job state to the cluster?"

I think this answer will depend on what controls are on the JIT expiry value, how Administrators of the cluster want to allow late/longer running jobs to finish (fail/continue?), and how important this scenario is to extensions developers. What do you think about recording your design considerations against this feature seperately?

@peternied I will add an issue about how to determine how long a token should be valid for. For some jobs, there is an expected run time which is what I am initially thinking for determining the token's expiration. The security plugin will enforce a maximum amount of minutes that auth tokens can be valid for.

opensearch-project / security