project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Identify unused indexes #127

Closed brent-hartwig closed 1 month ago

brent-hartwig commented 1 month ago

In the context CPU utilization sustaining 85% to 90% while loading data into a cluster that did not have forest-level replication enabled, Rob questioned the degree in which indexes contributed to CPU needs which got us wondering if there are some indexes we could stop populating.

indexComparisonChecks.js is an existing script that partially checks for missing and unused indexes by looking for index references and comparing to the deployed index configuration.

As part of the scope of this ticket:

  1. Update the script to account for missing auto complete index references --those specified by the namesIndexReference property.
  2. Since auto complete is not yet being used, add the ability to have the script include or exclude all of auto complete's index references.
  3. Document the script's limitations --references it does not take into account.
brent-hartwig commented 1 month ago

This script has been updated to account for all of auto complete's index references.

The script was run release1.15, in two modes.

Mode: with Auto Complete

{
  "missing":{
    "fields":[

    ],
    "fieldRanges":[
      "eventName",
      "eventPrimaryName"
    ]
  },
  "unused":{
    "fields":[
      "isCollectionItemBoolean",
      "itemName",
      "languageIdentifier",
      "placeSpatial",
      "referenceName",
      "setUsedForId",
      "workName"
    ],
    "fieldRanges":[
      "isCollectionItemBoolean",
      "languageIdentifier",
      "referenceName",
      "referencePrimaryName",
      "setPrimaryName",
      "setUsedForId"
    ]
  }
}

Mode: without Auto Complete

As may be expected, the list of unused indexes is longer. Further, we can derive event*Name is only referenced by auto complete.

{
  "missing":{
    "fields":[

    ],
    "fieldRanges":[

    ]
  },
  "unused":{
    "fields":[
      "agentName",
      "conceptName",
      "eventName",
      "isCollectionItemBoolean",
      "itemName",
      "languageIdentifier",
      "placeName",
      "placeSpatial",
      "referenceName",
      "setUsedForId",
      "workName"
    ],
    "fieldRanges":[
      "agentName",
      "agentPrimaryName",
      "conceptName",
      "conceptPrimaryName",
      "isCollectionItemBoolean",
      "languageIdentifier",
      "placeName",
      "placePrimaryName",
      "referenceName",
      "referencePrimaryName",
      "setPrimaryName",
      "setUsedForId"
    ]
  }
}
brent-hartwig commented 1 month ago

Script Limitations

Each field and field range index listed as missing or unused was checked. Results follow.

Note: some uses require just fields. Some uses require field range indexes. You cannot configure a field range index without also a field; however, you can configure a field without a field range index.

Note: search terms configured to the Indexed Word and Hop with Field search patterns are also configured to fields. These two search pattern implementations only require fields --not field range indexes.

  1. Not used at all:
    • The isCollectionItemBoolean field and field range index. Remnant of the OW Mashup effort.
    • The placeSpatial field. (A field range index is not presently configured for it.)
    • The referenceName field range index. (We may start using the field; see below.)
    • The referencePrimaryName field range index.
    • The setPrimaryName field range index.
    • The setUserForId field and field range index.
  2. Only referenced by auto complete:
    • The agentName field and field range index.
    • The conceptName field and field range index.
    • The eventName field and field range index.
    • The eventPrimaryName field range index. (The associated field is required by search terms.)
    • The placeName field and field range index.
  3. Gradle.properties:
    • We may start using the referenceName field for keyword search (in the related docs).
  4. The Similar feature's configuration retrieves XPath expressions from the database index configuration. It doesn't use the associated fields --just the configuration.
    • The itemName field.
    • In the same vein, I could see Similar's configuration starting to use the workName field, but it does not presently.
  5. The languageIdentifier field range index is required when generating the language data constants.

The top of the script now lists what the script doesn't cover:

/*
 * *Partially* checks for:
 *
 *   1. Fields and field range indexes the code is dependent on but not defined by the database.
 *   2. Fields and field range indexes the code is not dependent on but are defined by the database.
 *
 * Known limitations:
 *
 *   1. Excludes value of the fullTextSearchRelatedFieldName property (for keyword search).
 *   2. Excludes indexes referenced in data constant configuration files, specifically the call to
 *      cts.fieldReference('languageIdentifier') from within languages.mjs.
 *   3. Excludes references in the configuration of Similar search terms.  Technically, Similar does
 *      not use the indexes but instead snags XPath expressions from the index configuration; thus,
 *      unless/until Similar can get the XPaths elsewhere, the associated indexes are required.
 *
 * Individual field and field range index settings are not checked.
 */
brent-hartwig commented 1 month ago

Script improvements made in PR https://github.com/project-lux/lux-marklogic/pull/128, which was merged into the release1.16 branch. Closing this ticket. If we want to remove some indexes, let's track that in a new ticket.