project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Research Sorting By Property With A Specific Sibling Property (from 1067) #38

Closed gigamorph closed 3 months ago

gigamorph commented 4 months ago

Problem Description: Currently we cannot sort by sort_id which matches a specific set of ID's.

Per this teams thread an object can have a sort key for each set it's in. Figure out how to sort by [SORT_ID which matches specific set ID]

Expected Behavior/Solution: To sort by sort_id that matches a specific set of ID's

Requirements:

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

- [ ] Wireframe/Mockup - Mike

UAT/LUX Examples:

Dependencies/Blocks:

Related Github Issues: resources:

Related links:

Wireframe/Mockup: Place wireframe/mockup for the proposed solution at end of ticket.

clarkepeterf commented 4 months ago

This is the issue that was discusssed as going to @brent-hartwig

kkdavis14 commented 4 months ago

vote to prioritize this if at all possible, thanks!

brent-hartwig commented 3 months ago

@jffcamp, @prowns, @azaroth42, @kkdavis14, and @clarkepeterf, three implementation choices come to mind:

  1. Optic. In Optic, it is relatively trivial to select an additional piece of data from a document as a column, then sort by that column. This issue with this option is that we are still assessing whether to switch to Optic.
  2. CTS and range indexes. When asking https://docs.marklogic.com/cts.search to order the search results by a piece of data in the documents, that data must be in a range index. As such and based on https://git.yale.edu/lux-its/marklogic/issues/1067#issuecomment-25481, we would need to index /json/identified_by/content[../assigned_by/motivated_by/id] for each unique motivate by ID. The search request would then need to provide enough information that the backend could figure out which range index to use. I'm not sure the preceding XPath is supported for indexing content, meaning a data change might be required to support this option.
  3. Search with CTS and sort with custom code. This option is not expected to scale well. How many search results might we encounter?
kkdavis14 commented 3 months ago

I've been wondering why we're accounting for multiple archival sorting numbers, when I don't see that possibility in this use case (i.e. why in no.2 you would need to index each unique motivated_by ID). I checked the data and for the 2.2 million recs with this pattern, there is only ever one assigned_by/motivated_by per rec. Because, any thing in an archive is only in it's one hierarchy, it can't be in multiple (e.g. Peter's example from the original ticket can't ever occur, for Archival things).

In current data only Archival things are getting a sorting identifier and they only ever have one motivating assignor at a time (the thing that's directly above them--

https://linked-art.library.yale.edu/node/d899a9c6-d814-4058-9b58-6ef7c68b536f has sorting id assigned_by Magazine advertisements for cigarettes by brand https://linked-art.library.yale.edu/node/342d6b38-fc7d-4683-a076-dbec43ad5e73 has sorting id assigned_by Series I: Regular Size https://linked-art.library.yale.edu/node/849284f0-9d36-4d6e-9723-0d296559202c has sorting id assigned_by William Van Duyn Tobacco Advertisement Collection https://linked-art.library.yale.edu/node/523cebc7-5a18-4fa9-ae65-3c6ec6c72048

@azaroth42 Is this to account for some future where something else would be leveraging this pattern? Right now it's solving a need that doesn't exist.

prowns commented 3 months ago

@kkdavis14 - RS mentioned this use case in different convo about this topic yesterday: The only time it would ever matter is if the same thing were in two different sets for the purposes of sorting by different sort identifiers- you'd need to know which one to use. Which is possible in the future ... e.g. sort within archive vs sort within personal collection vs sort within exhibition ... but for today, only archives need the explicit sort id

kkdavis14 commented 3 months ago

ok, I would just want to not delay fixing this for the existing use case while we figure out how it works for future use cases we don't have yet. I don't know if that's the situation or not.

that being said, to answer no.3 of Brent's question of "how many search results", with the theoretical situations from Sarah's comment, the answer could be infinite.

roamye commented 3 months ago

No longer blocked since we have a SOW from @brent-hartwig

brent-hartwig commented 3 months ago

The following was written when under the impression an item could be part of multiple archives rather than multiple collections --a hierarchy of collections within a single archive.

Two approaches have been functionally proven out.

Opening notes:

Post-Search Sorting

Via XPath, custom JavaScript is able to retrieve the archive-specific value to sort by. This approach requires the documents to be pulled from disk; however, LUX's CTS search implementation already does so. As such, the additional time may be limited to a) executing the XPath in each search result, b) sorting those values, and c) selecting the subset of results per pagination parameters.

This approach is not expected to scale as well as the Optic approach; however, the archive with the most items directly associated to it has about 6,000 items. While the system was otherwise quiet, a search for all items in this archive plus sorting by this archive's sort values took just under 1.5 seconds.

luxSortAfterSearch.js.txt

Sort via Optic

LUX implements search via CTS; however, the generated CTS query can be given to Optic's op.fromSearch. Combined with one additional triple per item and archive pair, the results can be sorted by the search result's sort value for a specified archive.

The code requires triples whereby the subject is the search result IRI/URI, the predicate is the archive ID, and the object is the value to sort by. The predicate may justifiably be questioned, potentially resulting in a different triple pattern(s).

This approach was only functionally tested. If this approach is pursued, it will need to be tested at scale after the triples are added to the full dataset.

Optic search results are not filtered, which can lead to false positives when the search criteria cannot be resolved via indexes alone. An example is punctuation. While LUX's CTS search implementation presently filters search results, its unfiltered contexts (estimates and facets) have the same limitations.

luxSortWithOptic.js.txt

brent-hartwig commented 3 months ago

Research complete. Team decided #90 is the next step for this.