project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Research means to resolve keyword search criteria against additional content (from 583) #11

Open gigamorph opened 4 months ago

gigamorph commented 4 months ago

Problem Description: Spawned from #537..._

For Beta 1, we elected to go with a workaround that includes a new field named referencePrimaryName. That field has about half the values of the anyPrimaryName field.

When using the referencePrimaryName field on the three-term knossos ancient city search, we stopped getting the XDMP-XDQPINVREQ error and the search returned in under/around two seconds.

But, it omits some edge-case results. Examples:

  1. Unable to find books about paintings (work about work)
  2. TOC

Expected Behavior/Solution: After Beta 1, we'd like to re-engage with ML Engineering to see if there is a way to optimize, not omit results, and not get extra results. If that doesn't pan out, Rob had a 'TOC statement' denormalization idea the individual units could add to their data such that the edge case results are picked up by the search.

Requirements: TBD

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

- [ ] Wireframe/Mockup - Heather - [ ] Committee discussions - Sarah

UAT/LUX Examples:

Dependencies/Blocks:

Related Github Issues:

Related links:

Wireframe/Mockup:

prowns commented 3 months ago

Scope of ticket - this is about the inability to search for works about objects (e.g., books about Dort). Next step is engagement with engineering.

jffcamp commented 3 months ago

@brent-hartwig, while we are engaged with engineering, we should discuss if this is something we should include as part of our current engagement. And, if you have the time, is this something you can take on?

brent-hartwig commented 3 months ago

@jffcamp, I just tried a five AND'd keyword search and a five OR'd keyword search in both Optic and CTS. Results are interesting. The main takeaway appears to be that CTS generally handles larger field indexes better than Optic. Optic either fails or takes longer. CTS isn't necessarily fast enough to meet our target responses times. To answer your question, yes, I believe we should add this to batch 03 of the comparison effort. And a question for you and @azaroth42: is primary name the extent of our ambition or should we also seek improved performance with even larger fields, specifically alternative/equivalent names? We would need to add those indexes if we wanted to see how they perform.

Notes

The findings are a bit cryptic. These notes may help.

  1. I tested in DEV as SBX's dataset changed. As such, the following is pegged to ML 11.0.3 (DEV) as opposed to ML 11.1.0 (SBX).
  2. "06" and "07" reference specific queries from the Optic and CTS performance comparison.
  3. "ref" means only the referencePrimaryName field was used.
  4. "all" means that the primary name field of all six search scopes were used.
  5. Caches were not cleared between queries.
  6. Queries are available in the attached query console workspace: primary-name-perf-comp-qc-workspace.xml.txt. After dropping the txt file extension, it may be imported into Query Console. Each query includes the fullTextRelatedDocsIndexes variable. Either set to referencePrimaryName or allPrimaryNames.

Findings

  1. OPT-06: XDMP-XDQPINVREQ
  2. CTS-06 w/ ref: <300 ms
  3. CTS-06 w/ all: ~1100 ms
  4. OPT-06: SVC-MEMCANCELED: Canceled because of memory usage on host ip-10-5-156-154.ec2.internal, requestMemory=11549235840, totalMemory=11549235840, memoryLimit=11537481728, opID=10404615080181537886, opMem=14447245568
  5. OPT-06 w/ ref: 308ms, 125ms
  6. CTS-06 w/ ref: 157ms, 133ms
  7. CTS-06 w/ all: 619ms
  8. CTS-07 w/ all --1st 08 in DEV: 4700ms, 2650ms, 2450ms
  9. CTS-07 w/ ref --warmest caches: 637ms, 590ms
  10. OPT-07 w/ ref: 6100ms, 6090ms, 5985ms
  11. OPT-07 w/ all: 7860ms, 11731ms, 11400ms, 11375ms (odd first time was fast than subsequent attempts)

cc: @clarkepeterf, @prowns

brent-hartwig commented 1 month ago

100 and #132 are related. Best elaborated in #132's description, their shared objective is to resolve name search criteria against both primary and alternative names. This is not necessarily duplicate of this ticket as this ticket is more along the lines of engaging with Support / Engineering and goes beyond the subset of record types that populate the "reference*" fields.

Part of CTS and Optic search API comparison's batch 3 is to include this ticket.