project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Research - Implementing My Collections with MarkLogic #322

Open gigamorph opened 1 month ago

gigamorph commented 1 month ago

Problem Description

The team has decided to implement LUX-specific "My Collections" with MarkLogic and its API, instead of having a standalone application.

Expected Behavior/Solution

Requirements

Questions

Problems/Possible Blockers

Related links

MarkLogic Documentation

brent-hartwig commented 1 month ago

@gigamorph,

Is there any interest in leveraging the authentication work you already did within LUX's middle tier and then continuing to use a service account into MarkLogic?

Are you planning to store this data in a separate database from LUX's content (JSON-LD)? There's advantages to storing them in the same database but we'd have to be careful not to lose this new data when we reload the JSON-LD.

cc: @clarkepeterf

gigamorph commented 1 month ago

@brent-hartwig,

The direction we are taking after the team meeting where @azaroth42 and @clarkepeterf were present among others, is that we want to use MarkLogic as both data store and API provider for My Collections, using the AWS Cognito authentication, which is essentially OIDC/OAuth2.

Besides whether MarkLogic can support this flow, which we can take advantage of the middle tier as you suggested in case it doesn't work, @clarkepeterf and I have identified another problem. We need this PRD "My Collections" database/API up and running constantly (with minimal downtime, with appropriate notification to users) with real time currency of user-initiated updates, which doesn't jive at all with our current blue/green deployment scheme where we load up a complete set of non-PRD data that is not affected by user actions at all, and then send them into the PRD environment.

Before discussing this "currency" or "synchronization" issue in the team meeting, we did want to tap you for any insights you may have regarding it, too.

brent-hartwig commented 1 month ago

@gigamorph, there are several MarkLogic features to keep a couple databases in sync, including database replication, flexible replication, scheduled tasks, and backup/restore.

We'll want to keep a few things in mind while comparing them and any others that come up:

  1. Should the target database become unavailable for a period of time, will the solution automatically play catch up once it can?
  2. What about edits made immediately before, during, and after the blue-green switch? Taking backup for example, users would be none too happy if the last backup excluded their edit or the restore overwrote their edit.
  3. Which solution would be least error-prone and most straightforward to implement and maintain? We could implement a standalone process to perform the sync but what advantages would it offer?
  4. What system monitoring tests should we employ to ensure the solution is functioning as expected?

Here's a comparison of ML's two replication types: https://docs.marklogic.com/guide/database-replication/dbrep_intro#id_92346. Spoiler alert: If you want the new docs in the existing LUX content database, database replication is out.

With regard to automatically synchronizing after the target database comes back on line, I didn't quickly find documentation on how database replication handles this scenario, but expect it would. For flexible replication, your content processing framework (CPF) pipeline would have to account for it. A scheduled task could wake up every minute or so, maintain a last-sync timestamp and play catch up when needed. Backup and restore could also be employed whereby full backups could be created frequently but only restored during the switch.

With regard to edits in close proximity to the switch, it could be tricky for both replication types and the backup/restore route. The schedule task approach may support this scenario best. Let's say Green just became PROD. After all manual switching is otherwise complete, we can tell Blue to stop sending its docs to Green and tell Green to start sending its docs to Blue. I'd recommend a script or Gradle task that ensures the schedule task fires one last time before disabling in one environment and enabling in another.

Glancing at Yale's ML license, the license allows for all of the above-mentioned features.

I'm happy to run this by home base for validation and/or other options.

gigamorph commented 1 month ago

Opened a ML support ticket for Cognito auth flow: https://help.marklogic.com/Tickets/Ticket/View/37337

gigamorph commented 1 month ago

Key points about authentication:

gigamorph commented 1 month ago

Some key settings for OAuth in MarkLogic Admin:

External Security:

REST App Server:

Role (lux-endpoint-consumer):

gigamorph commented 1 month ago

Sample request with curl:

curl -i -H "Authorization: Bearer ${TOKEN}" ${URL}"

where ${TOKEN} is the access token obtained from Cognito after login, and ${URL} is the MarkLogic endpoint, e.g. http://localhost:8003/ds/lux/advancedSearchConfig.mjs

brent-hartwig commented 1 month ago

My notes from our 30 Sep meeting plus subsequent thoughts and requirement clarification:

  1. What does search in My Collection mean? Per requirement clarification on 3 Oct, it is limited to the individual documents referenced by the collection and excludes searches saved in the collection. Ideally, however, the implementation will not preclude the ability to support restraining a collection search to its saved searches (criteria thereof).
  2. Want to avoid an implementation that could enable a single user's action to update many documents, such as adding the documents from a large search result set into a MarkLogic collection (or adding what would boil down to user-specific permissions). Not only could this introduce merge activity in the background, it would also complicate overlaying this information onto an updated dataset prior to the blue/green switch (more on this below).
  3. We are to support multiple collections per user.
  4. We are to support a user adding or removing users from collections. TBD who would have these permissions (e.g., would the permissions be limited to the user that initially created the collection?).

  5. For user-specific authentication, the current line of thought is that the middle tier would need to route those requests (login + those with an authentication token) to a MarkLogic app server that is configured to Yale's authentication server.
  6. Only requests through the external authentication app server would be allowed to modify My Collections. We may be able to do this by checking details on the app server the request is being processed through.
  7. The MarkLogic code would be responsible for enforcing which My Collection data the authenticated user is allowed to modify.
  8. As some insurance from coding errors, we could create a role per user and grant the write permission to the user's documents. I'd want to have a better idea of how many users we anticipate before recommending this. My default is that it is not necessary.
  9. Yale's authentication server may assign a common group to everyone within, which would allow us to map that group to a role not given to service accounts. Restricting My Collection data to that role would further protect the data from being modified by the service accounts --presently the service accounts do not have the ability to insert or update any documents anyway.
  10. We brought up the idea to consolidate the two existing ML app servers into one. We introduced the second in the context of the now invalidated performance test. The net change could be:
    • All user-authentication requests go through an app server configured with external security.
    • All other requests go through one of the remaining two application servers --configured to internal security.

  11. A separate database could be used to store the My Collection data. Database-level replication could then be employed to be compatible with blue/green switches. Similar to scheduled jobs, the process would need to tell one database to stop pushing its edits and the other to start.
  12. Endpoint consumers do not presently have the execute privilege required to jump databases; however, this could be allowed using amps.
  13. How to ensure the user data is compatible with dataset updates. For example, LUX URIs and thus IRIs can change. Equivalent IDs may be the answer. Are there any other scenarios?

  14. Seong brought up caching. We'll want to look into ignoring read-only requests that are identical less the authentication token. It'll likely need to be more selective than that.
brent-hartwig commented 3 weeks ago

@clarkepeterf and @gigamorph, I changed the status of this ticket from Forming to In Progress because it is labeled as a research ticket (and research is underway). What do you consider necessary to complete this ticket? I propose once we deem it technically feasible (no known obstacles) plus a draft list of backend implementation tasks --tasks that could become implementation tickets. I'd also like to introduce a label for this feature, such as "my collections".

cc: @prowns, @jffcamp, @roamye

gigamorph commented 3 weeks ago

@brent-hartwig Submitted the "idea" for the JWKS URI feature at https://progressdataplatform.ideas.aha.io/ideas/ML-I-75

brent-hartwig commented 3 weeks ago

Action items from a meeting with @gigamorph:

While waiting for the JWKS URI feature to be implemented, we may need to employ a workaround, to automatically keep the JWKS public key configuration current in ML.

gigamorph commented 3 weeks ago

All requests from the middle tier is currently sharing a single MarkLogicClient to access an ML port. Under the OAuth scheme, it seems we need to create a new client instance for every request. Since it is all HTTP REST calls in the lower level anyway, I think it shouldn't have any significant hit on performance. Hopefully that is the case.

brent-hartwig commented 2 weeks ago

@gigamorph, I don't know how much overhead there is in creating DatabaseClient instances either but we may be able to call setAuthToken on an existing DatabaseClient instance when the middle tier request includes such a token. Here's the DatabaseClient's API documentation: https://docs.marklogic.com/jsdoc/DatabaseClient.html.