project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Research ability to "slice" data in support of unit portals #73

Open brent-hartwig opened 3 months ago

brent-hartwig commented 3 months ago

Problem Description: Individual Yale libraries and museums would like to utilize LUX functionality in branded web portals backed by LUX's MarkLogic database yet LUX does not presently have a mechanism to restrict its backend endpoints to a single unit's data (plus the data it shares with other units).

Expected Behavior/Solution: Backend endpoints are able to:

  1. Continue operating as we do today (i.e., all data: https://lux.collections.yale.edu/); or
  2. Only interact with data associated to a single unit (which may include internal documents, external documents, and documents shared with multiple units).

For no. 2, there may be some endpoints that need to execute in no. 1's mode. Search is not expected to be one.

Requirements:

  1. Support the above-listed expected behaviors.
  2. Backend is able to determine which unit(s) a document is associated with.
  3. Triple store and indexes also honor unit restrictions.
  4. Solution does not impose a performance penalty when consuming backend endpoints whether it be for a single unit or all units (LUX).
  5. There will be several follow-up requirements / TBDs once the feasibility study is further along; e.g., should the advanced search configuration endpoint be restricted to search terms that can return results for the requesting unit?

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

UAT/LUX Examples: Search limited to a unit's records. Same search performed in LUX would return the same results plus those from other units that also meet the search criteria.

Dependencies/Blocks: A proof-of-concept may become blocked by a data change request.

Related Github Issues: None at this time.

Related links: None

Wireframe/Mockup: Outside the scope of this ticket.

brent-hartwig commented 3 months ago

Created the 73-research-data-slices branch off of main. Initially, everything is isolated within the research/data-slices directory (i.e., no runtime changes yet).

brent-hartwig commented 3 months ago

As of 14 Mar 24, one may set up the data slice PoC locally and prove out search. See https://github.com/project-lux/lux-marklogic/blob/73-research-data-slices/README-Data-Slice-PoC.md for details.

brent-hartwig commented 2 months ago

Status and Findings Updates

LUX by Unit tenant deployed within ML DEV on 1 May. Tenant has its own ML resources, including roles. No forest-level replication. By 3 May, the YCBA and YUAG dataset was loaded and the entire stack was operational, passing all smoke tests but not yet being thoroughly vetted. The middle tier was configured to the YCBA service account. A next step for the team is to stand up at least one additional frontend and either another middle tier or extend the middle tier with additional connection pools in order to have two frontends available at the same time but configured to different service accounts.

Made some edits to facilitate additional ML tenants. Removed portions of the PoC that became obsolete given the better dataset. Updated / trimmed down PoC documentation. Updated checkTripleAndDocVisibilityByUser.js to align with new dataset and added controls to deal with the larger dataset. Per team decision on approach, still intending to fork this repo, move the data slice edits there, than abandon the 73-research-data-slices branch. Frontend and middle tier repos to follow suit, should unit portal changes be made.

In addition to the team beginning their testing starting in the frontends, I plan to double back to surface what other code changes may be necessary what highlight aspects to test. We're also a step closer to performance testing. We may need to wait for more units to be folded in. YCBA and YUAG dataset stats:

// Docs not associated to the YCBA or YUAG units, yet available to LUX.
const uris = cts
  .uris(
    '',
    null,
    cts.andNotQuery(
      cts.collectionQuery('lux-by-unit'),
      cts.orQuery([cts.collectionQuery('yuag'), cts.collectionQuery('ycba')])
    )
  )
  .toArray();

The 593K docs loaded in 196 seconds (via MLCP with 64 threads), which extrapolates to 4.5 hours to load the full dataset of 41m docs. That would be about half the normal load time. Dropping 1x forest-level replication explains why the load was faster. This proved out when loading the full (non-sliced) dataset into SBX after dropping its forest-level replication. As for data slices, it should be noted we introduced a transform that reaches into the documents to determine which roles to grant read permission --we wondered if this would have a significant performance impact. Apparently not. Should this initiative reach TST, we'll be able to compare as we intend to keep forest-level replication in TST and PRD.

More dataset stats! Actual counts via SPARQL:

Findings and/or draft conclusions via backend endpoints using different service accounts and scripts:

Lower-level functional findings derived from the hand-crafted mini dataset, some of which may be repetitive of the above:

brent-hartwig commented 6 days ago

Met w/ Jeff on 26 Jun 24. Received the green light to work the following into current priorities:

  1. Test and document the behaviors of the Hop Inverse search pattern.
  2. Reconcile differences between #108 (risk limited to events) and #119 (risk is more widespread), in the context of trimming dead-end triples from related lists.
  3. Investigate whether dead-end triples result in false positive semantic facet values.
  4. Unit-specific configuration files:
    • Identify search terms that are irrelevant to YPM.
    • Introduce means to restrict search terms (if not also search scopes*) by unit.
    • Modify the remaining search term generator to apply unit-specific settings and produce unit-specific versions.
    • Given some search terms are generated from facetsConfig.mjs, TBD whether we also need to restrict those by unit. This may be the tipping point to fold facetsConfig.mjs into searchTermConfig.mjs.
    • Modify remaining generators to create one output per unit-specific search term configuration.
    • Modify endpoints to use the correct configuration.

* Anticipated to be straightforward within the search term configuration file and generated outputs. TBD if searchScope.mjs needs to become unit-aware.