project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Investigate - Search Times Out #319

Open clarkepeterf opened 2 months ago

clarkepeterf commented 2 months ago

Problem Description: The following search times out:

{
    "AND": [
        {
            "created": {
                "classification": {
                    "name": "visual work"
                }
            }
        },
        {
            "occupation": {
                "name": "botanist"
            }
        }
    ]
}

Reported in https://www.bugherd.com/projects/284041/tasks/2556

Expected Behavior/Solution: Determine why this search times out and come up with potential solutions

Requirements: List of details required for the completion of the issue or requirements for the feature/bug. This can also include requirements that lie outside of the teams such as new design docs or clarification from an outside source.

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

- [ ] Wireframe/Mockup - Mike - [ ] Committee discussions - Sarah - [ ] Feasibility/Team discussion - Sarah - [ ] Backend requirements - TBD - [ ] Frontend requirements- TBD - [ ] Are new regression tests required for QA - Amy - [ ] Questions - List of questions for discussions. Answers should be documented within the issue.

UAT/LUX Examples:

Dependencies/Blocks:

- Blocked By: Issues that are blocking the completion of the current issue. - Blocking: Issues being blocked by the completion of the current issue.

Related Github Issues:

- Issues that contain similar work but are not blocking or being blocked by the current issue.

Related links:

Wireframe/Mockup: Place wireframe/mockup for the proposed solution at end of ticket.

brent-hartwig commented 2 months ago

@clarkepeterf, I looked at this enough to want to include it in the Optic/CTS comparison. It is to become query 16. CTS findings are below, ordered from slowest to fastest. All times are in milliseconds.

Query Description Filtered? First Run Warm Min Warm Max Warm Avg Std Dev Total Items Read
Original Yes 26610 26414 27519 26875 426 7106
Original No 5291 5012 6330 5473 426 7106
#113's optimizations with custom de-dup Yes 2175 2145 2202 2171 18 7106
#113's optimizations with cts.search de-dup Yes 1946 1924 2159 1976 63 7106
#113's optimizations with custom de-dup No 716 677 730 700 14 7106
#113's optimizations with cts.search de-dup No 408 388 420 403 12 7106

Notes:

  1. All six queries returned 646 results. A good plug for #223.
  2. I realized #113's second optimization can return duplicates. This is because it returns the objects from triples returned by cts.triples and those objects can be associated to multiple subjects. I wrote a way to de-dup them then compared the performance. For this query, cts.search is faster at de-duplicate the results. But note this search's cts.triples call only returns 1,437 results (2024-09-04 dataset).
  3. The original query took over 40 seconds the first time I ran it in a couple environments but never again, even after running xdmp.programCacheClear. I'm not sure why. <-- I since determined queries that include one or more calls to values functions benefit from a populated list cache. The list cache --and other group-level caches-- may be cleared using xdmp.groupCacheClear.

Scripts:

I have not yet created an Optic version.

cc: @jffcamp, @prowns, @azaroth42