project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Research - Implementing My Collections with MarkLogic #322

Open gigamorph opened 2 months ago

gigamorph commented 2 months ago

Problem Description

The team has decided to implement LUX-specific "My Collections" with MarkLogic and its API, instead of having a standalone application.

Expected Behavior/Solution

Requirements

Questions

Problems/Possible Blockers

Related links

MarkLogic Documentation

brent-hartwig commented 2 months ago

@gigamorph,

Is there any interest in leveraging the authentication work you already did within LUX's middle tier and then continuing to use a service account into MarkLogic?

Are you planning to store this data in a separate database from LUX's content (JSON-LD)? There's advantages to storing them in the same database but we'd have to be careful not to lose this new data when we reload the JSON-LD.

cc: @clarkepeterf

gigamorph commented 2 months ago

@brent-hartwig,

The direction we are taking after the team meeting where @azaroth42 and @clarkepeterf were present among others, is that we want to use MarkLogic as both data store and API provider for My Collections, using the AWS Cognito authentication, which is essentially OIDC/OAuth2.

Besides whether MarkLogic can support this flow, which we can take advantage of the middle tier as you suggested in case it doesn't work, @clarkepeterf and I have identified another problem. We need this PRD "My Collections" database/API up and running constantly (with minimal downtime, with appropriate notification to users) with real time currency of user-initiated updates, which doesn't jive at all with our current blue/green deployment scheme where we load up a complete set of non-PRD data that is not affected by user actions at all, and then send them into the PRD environment.

Before discussing this "currency" or "synchronization" issue in the team meeting, we did want to tap you for any insights you may have regarding it, too.

brent-hartwig commented 2 months ago

@gigamorph, there are several MarkLogic features to keep a couple databases in sync, including database replication, flexible replication, scheduled tasks, and backup/restore.

We'll want to keep a few things in mind while comparing them and any others that come up:

  1. Should the target database become unavailable for a period of time, will the solution automatically play catch up once it can?
  2. What about edits made immediately before, during, and after the blue-green switch? Taking backup for example, users would be none too happy if the last backup excluded their edit or the restore overwrote their edit.
  3. Which solution would be least error-prone and most straightforward to implement and maintain? We could implement a standalone process to perform the sync but what advantages would it offer?
  4. What system monitoring tests should we employ to ensure the solution is functioning as expected?

Here's a comparison of ML's two replication types: https://docs.marklogic.com/guide/database-replication/dbrep_intro#id_92346. Spoiler alert: If you want the new docs in the existing LUX content database, database replication is out.

With regard to automatically synchronizing after the target database comes back on line, I didn't quickly find documentation on how database replication handles this scenario, but expect it would. For flexible replication, your content processing framework (CPF) pipeline would have to account for it. A scheduled task could wake up every minute or so, maintain a last-sync timestamp and play catch up when needed. Backup and restore could also be employed whereby full backups could be created frequently but only restored during the switch.

With regard to edits in close proximity to the switch, it could be tricky for both replication types and the backup/restore route. The schedule task approach may support this scenario best. Let's say Green just became PROD. After all manual switching is otherwise complete, we can tell Blue to stop sending its docs to Green and tell Green to start sending its docs to Blue. I'd recommend a script or Gradle task that ensures the schedule task fires one last time before disabling in one environment and enabling in another.

Glancing at Yale's ML license, the license allows for all of the above-mentioned features.

I'm happy to run this by home base for validation and/or other options.

gigamorph commented 2 months ago

Opened a ML support ticket for Cognito auth flow: https://help.marklogic.com/Tickets/Ticket/View/37337

gigamorph commented 1 month ago

Key points about authentication:

gigamorph commented 1 month ago

Some key settings for OAuth in MarkLogic Admin:

External Security:

REST App Server:

Role (lux-endpoint-consumer):

gigamorph commented 1 month ago

Sample request with curl:

curl -i -H "Authorization: Bearer ${TOKEN}" ${URL}"

where ${TOKEN} is the access token obtained from Cognito after login, and ${URL} is the MarkLogic endpoint, e.g. http://localhost:8003/ds/lux/advancedSearchConfig.mjs

brent-hartwig commented 1 month ago

My notes from our 30 Sep meeting plus subsequent thoughts and requirement clarification:

  1. What does search in My Collection mean? Per requirement clarification on 3 Oct, it is limited to the individual documents referenced by the collection and excludes searches saved in the collection. Ideally, however, the implementation will not preclude the ability to support restraining a collection search to its saved searches (criteria thereof).
  2. Want to avoid an implementation that could enable a single user's action to update many documents, such as adding the documents from a large search result set into a MarkLogic collection (or adding what would boil down to user-specific permissions). Not only could this introduce merge activity in the background, it would also complicate overlaying this information onto an updated dataset prior to the blue/green switch (more on this below).
  3. We are to support multiple collections per user.
  4. We are to support a user adding or removing users from collections. TBD who would have these permissions (e.g., would the permissions be limited to the user that initially created the collection?).

  5. For user-specific authentication, the current line of thought is that the middle tier would need to route those requests (login + those with an authentication token) to a MarkLogic app server that is configured to Yale's authentication server.
  6. Only requests through the external authentication app server would be allowed to modify My Collections. We may be able to do this by checking details on the app server the request is being processed through.
  7. The MarkLogic code would be responsible for enforcing which My Collection data the authenticated user is allowed to modify.
  8. As some insurance from coding errors, we could create a role per user and grant the write permission to the user's documents. I'd want to have a better idea of how many users we anticipate before recommending this. My default is that it is not necessary.
  9. Yale's authentication server may assign a common group to everyone within, which would allow us to map that group to a role not given to service accounts. Restricting My Collection data to that role would further protect the data from being modified by the service accounts --presently the service accounts do not have the ability to insert or update any documents anyway.
  10. We brought up the idea to consolidate the two existing ML app servers into one. We introduced the second in the context of the now invalidated performance test. The net change could be:
    • All user-authentication requests go through an app server configured with external security.
    • All other requests go through one of the remaining two application servers --configured to internal security.

  11. A separate database could be used to store the My Collection data. Database-level replication could then be employed to be compatible with blue/green switches. Similar to scheduled jobs, the process would need to tell one database to stop pushing its edits and the other to start.
  12. Endpoint consumers do not presently have the execute privilege required to jump databases; however, this could be allowed using amps.
  13. How to ensure the user data is compatible with dataset updates. For example, LUX URIs and thus IRIs can change. Equivalent IDs may be the answer. Are there any other scenarios?

  14. Seong brought up caching. We'll want to look into ignoring read-only requests that are identical less the authentication token. It'll likely need to be more selective than that.
brent-hartwig commented 1 month ago

@clarkepeterf and @gigamorph, I changed the status of this ticket from Forming to In Progress because it is labeled as a research ticket (and research is underway). What do you consider necessary to complete this ticket? I propose once we deem it technically feasible (no known obstacles) plus a draft list of backend implementation tasks --tasks that could become implementation tickets. I'd also like to introduce a label for this feature, such as "my collections".

cc: @prowns, @jffcamp, @roamye

gigamorph commented 1 month ago

@brent-hartwig Submitted the "idea" for the JWKS URI feature at https://progressdataplatform.ideas.aha.io/ideas/ML-I-75

brent-hartwig commented 1 month ago

Action items from a meeting with @gigamorph:

While waiting for the JWKS URI feature to be implemented, we may need to employ a workaround, to automatically keep the JWKS public key configuration current in ML.

gigamorph commented 1 month ago

All requests from the middle tier is currently sharing a single MarkLogicClient to access an ML port. Under the OAuth scheme, it seems we need to create a new client instance for every request. Since it is all HTTP REST calls in the lower level anyway, I think it shouldn't have any significant hit on performance. Hopefully that is the case.

brent-hartwig commented 1 month ago

@gigamorph, I don't know how much overhead there is in creating DatabaseClient instances either but we may be able to call setAuthToken on an existing DatabaseClient instance when the middle tier request includes such a token. Here's the DatabaseClient's API documentation: https://docs.marklogic.com/jsdoc/DatabaseClient.html.

gigamorph commented 5 days ago

Created an app server in DEV at port 8007, named "lux-request-group-oauth-experiment".

Config: https://lux-ml-dev.collections.yale.edu:8000/manage/v2/servers/lux-request-group-oauth-experiment/properties?format=json&group-id=Default -> I don't think there's any sensitive information in the config.

External Security was added - named "chit-cognito-experiment", and external roles are mapped to the "lux-endpoint-consumer" role, of which "lux-service" represents the service account that middle tier uses, and "lux-users" all other users.

You can access the frontend at https://lux-front-exp.collections.yale.edu. If you're signed in, MarkLogic will see the signed in user. If not, MarkLogic will see username for the service account.

Overall, the current PoC is using the Cognito service that I have created in our own (CHIT) account. Thus you need to ask me to create an account if you want to experiment with logging in to the frontend.

Library Cognito service, it turns out, is not returning groups for non-CAS users, so it doesn't work with the current code of PoC.

brent-hartwig commented 4 days ago

Thanks for the update, @gigamorph. Are there any current blockers? I think we're okay so long as Cognito authenticates the user and provides the token that ML can then validate and allow the request in. If that's the case and we go by a service account naming convention, the backend code can then decide if the user can partake in My Collections functionality.

gigamorph commented 4 days ago

@brent-hartwig - no blockers comes to my mind currently, but then I have no idea how MarkLogic will utilize the token it received, and what information it will require (username, groups, and/or ?).

brent-hartwig commented 4 days ago

@gigamorph, would you configure a Cognito account for me and let me know what group I'm in. This should boil down to assigning an external store's group to a MarkLogic role.

brent-hartwig commented 3 days ago

@gigamorph, it looks like all the pieces are in place. When I log into https://lux-front-exp.collections.yale.edu/ using the account you set up for me, these entries appear in 8007_AccessLog.txt:

External User(4498b4b8-00b1-7094-c28a-c6436671ce2a) is Mapped to Temp User(4498b4b8-00b1-7094-c28a-c6436671ce2a) with Role(s): lux-endpoint-consumer 10.5.157.158 - 4498b4b8-00b1-7094-c28a-c6436671ce2a [22/Nov/2024:16:27:46 +0000] "POST /ds/lux/stats.mjs HTTP/1.1" 200 251 - -

Pieces/steps:

  1. Configuring the "chit-cognito-experiment" external security profile.
  2. Defining an app server configured to the above profile.
  3. Updating the definition of the lux-endpoint-consumer MarkLogic role to be associated to the lux-service and lux-users external security groups.
  4. Configuring a user account in Cognito that has the lux-service and lux-users (Cognito) group.

Because the xdmp.getRequestUser* functions return values such as follows, our user naming convention idea won't work. Instead, we could create two roles that extend lux-endpoint-consumer, associating one to lux-service and one to lux-users, and then only support My Collection functionality on the latter.

2024-11-22 16:45:17.501 Info: User name: 4498b4b8-00b1-7094-c28a-c6436671ce2a
2024-11-22 16:45:17.501 Info: User id: 9053464274652740159

A way to restrict My Collection functionality by role is to grant one of the roles a new execute privilege then requiring that privilege using https://docs.marklogic.com/xdmp.hasPrivilege or https://docs.marklogic.com/xdmp.securityAssert within the My Collection entry points / code base.

We may find it necessary or desirable to grant the lux-endpoint-consumer additional privileges. For instance, using xdmp.userRoles requires the http://marklogic.com/xdmp/privileges/xdmp-user-roles privilege, which the lux-endpoint-consumer role does not presently have.