Open brent-hartwig opened 3 months ago
Created the 73-research-data-slices branch off of main. Initially, everything is isolated within the research/data-slices directory (i.e., no runtime changes yet).
As of 14 Mar 24, one may set up the data slice PoC locally and prove out search. See https://github.com/project-lux/lux-marklogic/blob/73-research-data-slices/README-Data-Slice-PoC.md for details.
LUX by Unit tenant deployed within ML DEV on 1 May. Tenant has its own ML resources, including roles. No forest-level replication. By 3 May, the YCBA and YUAG dataset was loaded and the entire stack was operational, passing all smoke tests but not yet being thoroughly vetted. The middle tier was configured to the YCBA service account. A next step for the team is to stand up at least one additional frontend and either another middle tier or extend the middle tier with additional connection pools in order to have two frontends available at the same time but configured to different service accounts.
Made some edits to facilitate additional ML tenants. Removed portions of the PoC that became obsolete given the better dataset. Updated / trimmed down PoC documentation. Updated checkTripleAndDocVisibilityByUser.js to align with new dataset and added controls to deal with the larger dataset. Per team decision on approach, still intending to fork this repo, move the data slice edits there, than abandon the 73-research-data-slices branch. Frontend and middle tier repos to follow suit, should unit portal changes be made.
In addition to the team beginning their testing starting in the frontends, I plan to double back to surface what other code changes may be necessary what highlight aspects to test. We're also a step closer to performance testing. We may need to wait for more units to be folded in. YCBA and YUAG dataset stats:
// Docs not associated to the YCBA or YUAG units, yet available to LUX.
const uris = cts
.uris(
'',
null,
cts.andNotQuery(
cts.collectionQuery('lux-by-unit'),
cts.orQuery([cts.collectionQuery('yuag'), cts.collectionQuery('ycba')])
)
)
.toArray();
The 593K docs loaded in 196 seconds (via MLCP with 64 threads), which extrapolates to 4.5 hours to load the full dataset of 41m docs. That would be about half the normal load time. Dropping 1x forest-level replication explains why the load was faster. This proved out when loading the full (non-sliced) dataset into SBX after dropping its forest-level replication. As for data slices, it should be noted we introduced a transform that reaches into the documents to determine which roles to grant read permission --we wondered if this would have a significant performance impact. Apparently not. Should this initiative reach TST, we'll be able to compare as we intend to keep forest-level replication in TST and PRD.
More dataset stats! Actual counts via SPARQL:
Findings and/or draft conclusions via backend endpoints using different service accounts and scripts:
{"_scope":"item","producedBy":{"id":"https://lux.collections.yale.edu/data/agent/a"}}
). This has direct implications on the Keyword and Hop with Field search patterns. Generically, the requesting unit must have access to the documents that the search criteria is resolved within. Depending on the search criteria, it may need to be resolved in the search result document or a related document.{"_scope":"agent","produced":{"id":"https://lux.collections.yale.edu/data/object/a"}}
).{"AND":[{"_lang":"en","name":"\"A Castle Tower, Caernarvon Castle\""},{"aboutPlace":{"name":"\"Caernarfonshire and Merionethshire\""}}]}
stats
endpoint uses cts.jsonPropertyValueQuery
which has proven to honor document permissions, meaning responses vary by unit-specific service account.document
endpoint only returns documents the service account has access to and --separately-- that an amp can enable the endpoint to return any requested document (regardless of service account). Full expectation that we could do the equivalent for search yet still only return the requesting unit’s documents. Did not investigate a way to vary this behavior by unit (e.g., Unit A wants the ability to request any document while Unit B wants the endpoint to be restricted to its documents).Lower-level functional findings derived from the hand-crafted mini dataset, some of which may be repetitive of the above:
cts.fieldValues
only returns values from documents the requesting user may access.cts.triples
only returns triples defined in documents the requesting user may access.fn.docAvailable
returns false for documents identified by a triple’s IRI that the requesting user does not have access to, thereby validating “dead-end” triple paths exist.Met w/ Jeff on 26 Jun 24. Received the green light to work the following into current priorities:
* Anticipated to be straightforward within the search term configuration file and generated outputs. TBD if searchScope.mjs needs to become unit-aware.
Problem Description: Individual Yale libraries and museums would like to utilize LUX functionality in branded web portals backed by LUX's MarkLogic database yet LUX does not presently have a mechanism to restrict its backend endpoints to a single unit's data (plus the data it shares with other units).
Expected Behavior/Solution: Backend endpoints are able to:
For no. 2, there may be some endpoints that need to execute in no. 1's mode. Search is not expected to be one.
Requirements:
Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.
UAT/LUX Examples: Search limited to a unit's records. Same search performed in LUX would return the same results plus those from other units that also meet the search criteria.
Dependencies/Blocks: A proof-of-concept may become blocked by a data change request.
Related Github Issues: None at this time.
Related links: None
Wireframe/Mockup: Outside the scope of this ticket.