Research ability to "slice" data in support of unit portals

brent-hartwig commented 3 months ago

Problem Description: Individual Yale libraries and museums would like to utilize LUX functionality in branded web portals backed by LUX's MarkLogic database yet LUX does not presently have a mechanism to restrict its backend endpoints to a single unit's data (plus the data it shares with other units).

Expected Behavior/Solution: Backend endpoints are able to:

Continue operating as we do today (i.e., all data: https://lux.collections.yale.edu/); or
Only interact with data associated to a single unit (which may include internal documents, external documents, and documents shared with multiple units).

For no. 2, there may be some endpoints that need to execute in no. 1's mode. Search is not expected to be one.

Requirements:

Support the above-listed expected behaviors.
Backend is able to determine which unit(s) a document is associated with.
Triple store and indexes also honor unit restrictions.
Solution does not impose a performance penalty when consuming backend endpoints whether it be for a single unit or all units (LUX).
There will be several follow-up requirements / TBDs once the feasibility study is further along; e.g., should the advanced search configuration endpoint be restricted to search terms that can return results for the requesting unit?

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

[x] Wireframe/Mockup - Outside the scope of this ticket. Individual units would be responsible for their frontend (and possible middle tier) but are encouraged to utilize what they can from LUX's frontend (and, if applicable, middle tier).
[x] Committee discussions - Already engaged with first candidate.
[x] Feasibility/Team discussion - Scope of this ticket.
[x] Backend requirements - See above
[x] Frontend requirements- Outside the scope of this ticket.
[x] Questions
~List of questions for discussions. Answers should be documented within the issue.~

UAT/LUX Examples: Search limited to a unit's records. Same search performed in LUX would return the same results plus those from other units that also meet the search criteria.

Dependencies/Blocks: A proof-of-concept may become blocked by a data change request.

Related Github Issues: None at this time.

Related links: None

Wireframe/Mockup: Outside the scope of this ticket.

brent-hartwig commented 3 months ago

Created the 73-research-data-slices branch off of main. Initially, everything is isolated within the research/data-slices directory (i.e., no runtime changes yet).

brent-hartwig commented 3 months ago

As of 14 Mar 24, one may set up the data slice PoC locally and prove out search. See https://github.com/project-lux/lux-marklogic/blob/73-research-data-slices/README-Data-Slice-PoC.md for details.

brent-hartwig commented 2 months ago

Status and Findings Updates

LUX by Unit tenant deployed within ML DEV on 1 May. Tenant has its own ML resources, including roles. No forest-level replication. By 3 May, the YCBA and YUAG dataset was loaded and the entire stack was operational, passing all smoke tests but not yet being thoroughly vetted. The middle tier was configured to the YCBA service account. A next step for the team is to stand up at least one additional frontend and either another middle tier or extend the middle tier with additional connection pools in order to have two frontends available at the same time but configured to different service accounts.

Made some edits to facilitate additional ML tenants. Removed portions of the PoC that became obsolete given the better dataset. Updated / trimmed down PoC documentation. Updated checkTripleAndDocVisibilityByUser.js to align with new dataset and added controls to deal with the larger dataset. Per team decision on approach, still intending to fork this repo, move the data slice edits there, than abandon the 73-research-data-slices branch. Frontend and middle tier repos to follow suit, should unit portal changes be made.

In addition to the team beginning their testing starting in the frontends, I plan to double back to surface what other code changes may be necessary what highlight aspects to test. We're also a step closer to performance testing. We may need to wait for more units to be folded in. YCBA and YUAG dataset stats:

The lux-by-unit-reader role received the read permission to 592,983 documents.
The ycba-by-unit-reader role received the read permission to 164,840 documents.
The yuag-by-unit-reader role received the read permission to 446,161 documents.
18,104 documents are accessible to both YCBA and YUAG.
There are 86 documents that are incorrectly associated to the 'create' unit, 'update' unit, or no unit at all. Only the lux-by-unit-reader role was granted permission to them. In case they can assist testing, they may be identified using the following query.

// Docs not associated to the YCBA or YUAG units, yet available to LUX.
const uris = cts
  .uris(
    '',
    null,
    cts.andNotQuery(
      cts.collectionQuery('lux-by-unit'),
      cts.orQuery([cts.collectionQuery('yuag'), cts.collectionQuery('ycba')])
    )
  )
  .toArray();

The 593K docs loaded in 196 seconds (via MLCP with 64 threads), which extrapolates to 4.5 hours to load the full dataset of 41m docs. That would be about half the normal load time. Dropping 1x forest-level replication explains why the load was faster. This proved out when loading the full (non-sliced) dataset into SBX after dropping its forest-level replication. As for data slices, it should be noted we introduced a transform that reaches into the documents to determine which roles to grant read permission --we wondered if this would have a significant performance impact. Apparently not. Should this initiative reach TST, we'll be able to compare as we intend to keep forest-level replication in TST and PRD.

More dataset stats! Actual counts via SPARQL:

17,145,290 triples
592,983 unique subjects
58 unique predicates
1,151,493 unique objects

Findings and/or draft conclusions via backend endpoints using different service accounts and scripts:

Dead-end triples exist. There's been a concern this can have a negative impact on related lists and possibly semantic facets. #108's scripted deep dive into related lists eased the concern but those findings (internal link) became suspect after implementing #119's script. That script surfaced additional dead-end triples which may be part of related list triple searches. Stay tuned.
Given a) the requesting unit has access to the search result document and b) the search result document defines triples, then search criteria can include any data in the triples, even if the unit does not also have access to the document identified in a triple. Example: Unit A searches for Objects produced by Agent A and finds Object A, may access Object A but not Agent A ({"_scope":"item","producedBy":{"id":"https://lux.collections.yale.edu/data/agent/a"}}). This has direct implications on the Keyword and Hop with Field search patterns. Generically, the requesting unit must have access to the documents that the search criteria is resolved within. Depending on the search criteria, it may need to be resolved in the search result document or a related document.
In the case that all search criteria is resolved in related documents, the requesting unit will still need access to the search result document for it to appear in the search results. Inverting the above example illustrates this: if Unit A searched for Agents that produced Object A (Hop Inverse search pattern), the search results would not include Agent A despite Object A declaring Agent A produced it ({"_scope":"agent","produced":{"id":"https://lux.collections.yale.edu/data/object/a"}}).
An important one to YPM: criteria is not resolved in another unit’s documents. Example: {"AND":[{"_lang":"en","name":"\"A Castle Tower, Caernarvon Castle\""},{"aboutPlace":{"name":"\"Caernarfonshire and Merionethshire\""}}]}
The stats endpoint uses cts.jsonPropertyValueQuery which has proven to honor document permissions, meaning responses vary by unit-specific service account.
Proved the document endpoint only returns documents the service account has access to and --separately-- that an amp can enable the endpoint to return any requested document (regardless of service account). Full expectation that we could do the equivalent for search yet still only return the requesting unit’s documents. Did not investigate a way to vary this behavior by unit (e.g., Unit A wants the ability to request any document while Unit B wants the endpoint to be restricted to its documents).

Lower-level functional findings derived from the hand-crafted mini dataset, some of which may be repetitive of the above:

The LUX search endpoint only returned documents the user has permission to.
- All searches used the Keyword search pattern.
- The Keyword search pattern ORs non-semantic and semantic search criteria.
- The semantic search criteria portion matches the Hop with Field search pattern.
- Keywords are resolved against field indexes; however, a semantic hop to related documents is first required for the semantic search criteria.
- Using unique keywords, MarkLogic proved to apply document permissions when resolving values against field ranges and finding related documents via triple store.
Bolstering search’s findings via script:
- cts.fieldValues only returns values from documents the requesting user may access.
- cts.triples only returns triples defined in documents the requesting user may access.
- fn.docAvailable returns false for documents identified by a triple’s IRI that the requesting user does not have access to, thereby validating “dead-end” triple paths exist.

brent-hartwig commented 6 days ago

Met w/ Jeff on 26 Jun 24. Received the green light to work the following into current priorities:

Test and document the behaviors of the Hop Inverse search pattern.
Reconcile differences between #108 (risk limited to events) and #119 (risk is more widespread), in the context of trimming dead-end triples from related lists.
Investigate whether dead-end triples result in false positive semantic facet values.
Unit-specific configuration files:
- Identify search terms that are irrelevant to YPM.
- Introduce means to restrict search terms (if not also search scopes*) by unit.
- Modify the remaining search term generator to apply unit-specific settings and produce unit-specific versions.
- Given some search terms are generated from facetsConfig.mjs, TBD whether we also need to restrict those by unit. This may be the tipping point to fold facetsConfig.mjs into searchTermConfig.mjs.
- Modify remaining generators to create one output per unit-specific search term configuration.
- Modify endpoints to use the correct configuration.

* Anticipated to be straightforward within the search term configuration file and generated outputs. TBD if searchScope.mjs needs to become unit-aware.

project-lux / lux-marklogic

Research ability to "slice" data in support of unit portals #73

Status and Findings Updates