Research: Performance Enhancements for Time out Issues

roamye commented 1 month ago

Problem Description: There have been several time out errors in Advanced Search queries which only have one multiple field. (see example below) This issue serves as a research ticket for all performance enhancements for time out issues not related to #113 to figure out how to solve this issue.

Another time out issue is when users select the last page of a search result. (example below)

Expected Behavior/Solution: Research on how to fix this time out issue. Solution is TBD Possible solution: increase our time out

Requirements: TBD

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

[ ] Wireframe/Mockup - Mike
[ ] Committee discussions - Sarah
[ ] Feasibility/Team discussion - Sarah
[ ] Backend requirements - TBD
[ ] Frontend requirements- TBD
[ ] Questions
List of questions for discussions. Answers should be documented within the issue.

UAT/LUX Examples:

AS query timeout
selecting last page on SS for colors: https://lux-front-tst.collections.yale.edu/view/results/objects?q=%7B%22text%22%3A%22colors%22%2C%22_lang%22%3A%22en%22%7D&sq=colors&view=list&is=anySortName%3Aasc&ip=5457

Dependencies/Blocks:

Blocked By: Issues that are blocking the completion of the current issue.
Blocking: Issues being blocked by the completion of the current issue.

Related Github Issues:

Issues that contain similar work but are not blocking or being blocked by the current issue.

Related links:

Bugherd: https://www.bugherd.com/projects/284041/tasks/2180

Wireframe/Mockup: Place wireframe/mockup for the proposed solution at end of ticket.

brent-hartwig commented 1 month ago

This ticket's scope may be too broad. It's typically better to focus on one timeout / perf issue at a time.

In the case of requesting the last page of a search's results, the frontend is specifying pagination parameters that has the current implementation of the backend perform a search that has to filter results through page * page length. Filtering is the process of pulling the document from disk and through d-node then e-node caches to validate it meets all search criteria. The process may end up dropping some --when indexes alone were unable to apply all of the criteria (unfiltered).

When searching for objects containing "colors", the estimate (unfiltered) comes back 109,127. With cold caches, it took 22.5 seconds to get to the last seven. With warm caches, it still took 16.1 seconds. An unfiltered search consistently returns the same last seven results in 117 milliseconds, illustrating this search can be accurately resolved by indexes alone and that the filtering process is adding 16 or seconds to double check the 109K results.

Things we may want to discuss:

Switching to unfiltered searches and seeking out instances where false positives show up. On this project, there have been legitimate and illegitimate instances attributed to differences between unfiltered and filtered results. It would be helpful to isolate the legitimate instances and split that list between those that can be eliminated via index changes and those that will persist --then decide if we can switch to unfiltered searches and reap the performance benefits.
When the requested page is more than 50% into cts.estimate, reverse the sort and paginate/filter the other way. For more details and ideas, see https://git.yale.edu/lux-its/marklogic/issues/1010 (internal).
Do not offer a Last page link in the frontend. What value does it offer? Users can reverse the sort when interested in those results.
Is it reasonable/expected for a relatively quiet cluster of three 16 vCPU 128 GB memory nodes to take 16 seconds to filter 109K results? Below is this search's generated CTS query, specifying what the filtering process would have had to validate in each unfiltered result. I posted an internal inquiry.

cts.andQuery([
  cts.jsonPropertyValueQuery(
    'dataType',
    ['DigitalObject', 'HumanMadeObject'],
    ['exact']
  ),
  cts.orQuery([
    cts.fieldWordQuery(
      ['itemAnyText'],
      'colors',
      [
        'case-insensitive',
        'diacritic-insensitive',
        'punctuation-insensitive',
        'whitespace-insensitive',
        'stemmed',
        'wildcarded',
      ],
      1
    ),
    cts.tripleRangeQuery(
      [],
      [[lux('itemAny')]],
      fn.insertBefore(
        cts.values(
          cts.iriReference(),
          '',
          ['eager', 'concurrent'],
          cts.fieldWordQuery(
            ['referencePrimaryName'],
            'colors',
            [
              'case-insensitive',
              'diacritic-insensitive',
              'punctuation-insensitive',
              'whitespace-insensitive',
              'stemmed',
              'wildcarded',
            ],
            1
          )
        ),
        0,
        sem.iri('/does/not/exist')
      ),
      '=',
      [],
      1
    ),
  ]),
]);

brent-hartwig commented 1 month ago

I received some internal feedback...

You could have the front end reverse the sort and select the first page when they click the last button

It's a valid idea but I like no. 2 is more comprehensive and doesn't impose on backend endpoint consumers.

16 sec for 100k is 0.16 msec each doc which is pretty quick. Maybe it isn't expected that it is single threaded on the enode but, I believe that is the case for that stage of the query processing.

I confirmed the single-thread belief and submitted https://progressdataplatform.ideas.aha.io/ideas/ML-I-42.

Having said that, I am still much more in favor option no. 1.

clarkepeterf commented 1 week ago

@brent-hartwig I agree it would be great to look at option no. 1 - Many of our searches are resolvable only using indexes so should be able to go unfiltered. Would be interesting to see which subset of searches requires filtering and if there is a way to enable it to work unfiltered or just accept the false positives

brent-hartwig commented 1 week ago

Would be interesting to see which subset of searches requires filtering and if there is a way to enable it to work unfiltered or just accept the false positives

Part of CTS/Optic batch no. 3 👍 . We are becoming familiar with what cannot be resolved by indexes, including punctuation and whitespace. We need a complete list and decide whether those are important enough to LUX to accept the performance penalty imposed by filtering and prevent us from adopting Optic. Whenever we encounter a false positive that isn't explained by the list, we need to question our index configuration and code.

project-lux / lux-marklogic

Research: Performance Enhancements for Time out Issues #136