Determine if more predicates can improve semantic search performance (from 988)

gigamorph commented 4 months ago

The text search pattern for collection items includes one call to cts.triples. Its CTS query parameter is the standard text search pattern, for objects. As documented in https://git.yale.edu/lux-its/marklogic/issues/986#issuecomment-21704, we noticed the call to cts.triples alone could take between 1,100 ms and 9,000 ms to return 242K or 540K triples.

The query below is the call to cts.triples(objectTextQuery).toArray.map.length where objectTextQuery is the standard text search pattern for 'history'. It was clocked at 9 seconds with cold caches and 3.5 seconds with warm caches. These times may be pegged to the following group-level cache settings. Findings from #920 may influence future settings.

    "list-cache-size": 16384,
    "list-cache-partitions": 6,
    "compressed-tree-cache-size": 8192,
    "compressed-tree-cache-partitions": 11,
    "expanded-tree-cache-size": 16384,
    "expanded-tree-cache-partitions": 11,
    "triple-cache-size": 16384,
    "triple-cache-partitions": 16,
    "triple-value-cache-size": 32768,
    "triple-value-cache-partitions": 32,
    "compressed-tree-read-size": 32,
    "triple-cache-timeout": 86400,
    "triple-value-cache-timeout": 86400,

The query uses the lux('carries_or_shows') and lux('itemAny') predicates, as well as the itemAnyText and referencePrimaryName range indexes. The scope of this ticket is to question whether it is possible to speed up the query by splitting the lux('carries_or_shows') and/or lux('itemAny') sets of triples. The hope is that a single query can access more of the triple store in less time by increasing the number of calls to cts.triples and/or cts.tripleRangeQuery. The same number of triples are to be processed; we would just be using more predicates.

The dataset does not presently have the more granular triples, specifically for lux('itemAny'). Our approach is to first make this a theoretical discussion with MarkLogic Support / Engineering then decide if there is sufficient potential to test, at which point new triples would be needed. As the scope of #989, we are also investigating the possibility of replacing all uses of triples in search with range indexes.

const op = require('/MarkLogic/optic');
const lux = op.prefixer('https://lux.collections.yale.edu/ns/');

const term = 'history';

cts
  .triples(
    [],
    [lux('carries_or_shows')],
    [],
    '=',
    ['eager', 'concurrent'],
    cts.orQuery([
      cts.fieldWordQuery(
        ['itemAnyText'],
        [term],
        [
          'case-insensitive',
          'diacritic-insensitive',
          'punctuation-insensitive',
          'whitespace-insensitive',
          'stemmed',
          'wildcarded',
        ],
        1
      ),
      cts.tripleRangeQuery(
        [],
        [lux('itemAny')],
        fn.insertBefore(
          cts.values(
            cts.iriReference(),
            '',
            ['eager', 'concurrent'],
            cts.fieldWordQuery(
              ['referencePrimaryName'],
              [term],
              [
                'case-insensitive',
                'diacritic-insensitive',
                'punctuation-insensitive',
                'whitespace-insensitive',
                'stemmed',
                'wildcarded',
              ],
              1
            )
          ),
          0,
          sem.iri('/does/not/exist')
        ),
        '=',
        [],
        1
      ),
    ])
  )
  .toArray()
  .map((x) => sem.tripleObject(x))
  .concat(sem.iri('/does/not/exist'))
  .length;

brent-hartwig commented 3 months ago

Corrected the title. Upon cleaning up the tickets leading up to the migration, the title of this ticket was inadvertently swapped with ticket 989.

brent-hartwig commented 2 months ago

@jffcamp, @prowns, @azaroth42, @kkdavis14, and @clarkepeterf,

An internal inquiry questioned which of the following would perform better where ‘ab’ is a predicate for many triples which could be split into two subsets having predicates ‘a’ and ‘b’.

cts.tripleRangeQuery([], [‘ab’], objects)
cts.tripleRangeQuery([], [‘a’, ‘b’], objects)
cts.orQuery([cts.tripleRangeQuery([], [‘a’], objects), cts.tripleRangeQuery([], [‘b’], objects)])

The belief is that no. 1 would be marginally faster than the other two. Between nos. 2 and 3, it is possible for one of those to be faster than the other (but probably not faster than no. 1).

Replacing cts.tripleRangeQuery with cts.triples, no. 1 is believed to be faster as well.

The sort order of the triple results could be different. When relevant, the difference would be more noticeable as the number of processed triples go up, with operations such as sort-merge join, sort-based grouping, and order by. No. 1 explicitly offers a sort performance advantage: “Accessing a single predicate gives more options for the query to find the sort it needs directly from the index.”

No. 1 is what we have now, at least for lux('carries_or_shows'). I'm not sure how many other predicates we would want to merge should that approach be proven faster as it would take from granularity. I anticipate more potential if no. 2 or 3 are proven faster than 1.

If there is interest in pursuing this, we have a few options to drive towards a definitive answer:

Ask ML Support.
Ask the ML Community.
Try ourselves. This option would require a dataset that includes the combined and separated triples (via predicate) that we would want to test. For example, this could include keeping lux('itemAny') and adding lux('digitalObjectAny') and lux('humanMadeObjectAny').

We could also revisit this if we stick with CTS and need more speed.

azaroth42 commented 2 months ago

Given that 1 is what we have, propose close, done, no change needed (until Optic, anyway)

roamye commented 2 months ago

Closing per UAT

project-lux / lux-marklogic

Determine if more predicates can improve semantic search performance (from 988) #36