Closed gigamorph closed 2 months ago
Corrected the title. Upon cleaning up the tickets leading up to the migration, the title of this ticket was inadvertently swapped with ticket 989.
@jffcamp, @prowns, @azaroth42, @kkdavis14, and @clarkepeterf,
An internal inquiry questioned which of the following would perform better where ‘ab’ is a predicate for many triples which could be split into two subsets having predicates ‘a’ and ‘b’.
The belief is that no. 1 would be marginally faster than the other two. Between nos. 2 and 3, it is possible for one of those to be faster than the other (but probably not faster than no. 1).
Replacing cts.tripleRangeQuery
with cts.triples
, no. 1 is believed to be faster as well.
The sort order of the triple results could be different. When relevant, the difference would be more noticeable as the number of processed triples go up, with operations such as sort-merge join, sort-based grouping, and order by. No. 1 explicitly offers a sort performance advantage: “Accessing a single predicate gives more options for the query to find the sort it needs directly from the index.”
No. 1 is what we have now, at least for lux('carries_or_shows')
. I'm not sure how many other predicates we would want to merge should that approach be proven faster as it would take from granularity. I anticipate more potential if no. 2 or 3 are proven faster than 1.
If there is interest in pursuing this, we have a few options to drive towards a definitive answer:
lux('itemAny')
and adding lux('digitalObjectAny')
and lux('humanMadeObjectAny')
.We could also revisit this if we stick with CTS and need more speed.
Given that 1 is what we have, propose close, done, no change needed (until Optic, anyway)
Closing per UAT
The text search pattern for collection items includes one call to cts.triples. Its CTS query parameter is the standard text search pattern, for objects. As documented in https://git.yale.edu/lux-its/marklogic/issues/986#issuecomment-21704, we noticed the call to cts.triples alone could take between 1,100 ms and 9,000 ms to return 242K or 540K triples.
The query below is the call to cts.triples(objectTextQuery).toArray.map.length where objectTextQuery is the standard text search pattern for 'history'. It was clocked at 9 seconds with cold caches and 3.5 seconds with warm caches. These times may be pegged to the following group-level cache settings. Findings from #920 may influence future settings.
The query uses the
lux('carries_or_shows')
andlux('itemAny')
predicates, as well as theitemAnyText
andreferencePrimaryName
range indexes. The scope of this ticket is to question whether it is possible to speed up the query by splitting thelux('carries_or_shows')
and/orlux('itemAny')
sets of triples. The hope is that a single query can access more of the triple store in less time by increasing the number of calls to cts.triples and/or cts.tripleRangeQuery. The same number of triples are to be processed; we would just be using more predicates.The dataset does not presently have the more granular triples, specifically for
lux('itemAny')
. Our approach is to first make this a theoretical discussion with MarkLogic Support / Engineering then decide if there is sufficient potential to test, at which point new triples would be needed. As the scope of #989, we are also investigating the possibility of replacing all uses of triples in search with range indexes.