LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design

Description

This ticket is about the data structure & OpenSearch index schema design.

Technical Requirements

The data structure should be generic enough to cover:
- our evolving overall indexing strategy described in epic: #3503
- all kinds of logical / physical data item
- store sufficient metadata to recover the origin context & retrieve the original item

Proposed Data structure & Indexing Structure

We will only use one index for storing all LLM indexing information.

We will define the following fields:

itemType:
- type: keyword
- possible value:
- registryRecord: a registry record. It could be a dataset, distribution, organisation etc. We likely only care about dataset & distribution initially. When itemType = 'registryRecord', the recordId & aspectId field must be present.
- storageObject: indicates the index target of this index item is a storage object (file).
- In future, we could add more itemType to support more use cases e.g. api. we could index API purpose plus its open API schema so it's available as tools for LLM to chose from
recordId: optional; the registry record id of the record that we index for. Only available when itemType = registryRecord
- type: keyword
aspectId: optional; the aspect id of the text field that we index on. Only available when itemType = registryRecord
- type: keyword
fieldName: optional; the field name of the field that we index on. Only available when itemType = registryRecord
- type: keyword
fileFormat: optional; Only available when itemType = storageObject
- type: keyword
subObjectId: optional; Only available when itemType = storageObject and when we need to index some non-text item. And when the in-context id of this sub-item is available.
- e.g. some papers might id the first diagram as fig.1
- Could also be other referenceable non-text content. e.g. data table.
subObjectType: optional; Only available when itemType = storageObject and when we need to index some referenceable non-text item.
- possible value:
- diagram
- chart
- table
index_text_chunk:
- type: keyword
- Please note: it's up to the indexing strategy defined in #3503 and relevant indexing strategy tickets to define how to construct the index_text_chunk.
- e.g. for dataset's description field, it would be simply a text chunk of the original text content
- e.g. for indexing a diagram in a PDF paper, you might want to include:
  - the short description of the diagram. Often underneath of the diagram
  - text chunks where the diagram is referenced in the paper.
embedding: store the embedding of the text chunk of the indexed text content (i.e. the content in index_text_chunk field).
- type: knn_vector
- dimension: 256? 512 to be decided
only_one_index_text_chunk: indicate whether the item is indexed by more than one text chuck.
- type: boolean
index_text_chunk_length:
- type: integer
index_text_chunk_position: the start position of the text chunk within the original full-text content
- type: integer
index_text_chunk_padding: no.of chars should be cut off at the joining point for each chunk when joining more than one chunk together
- type: integer

magda-io / magda

LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

Description

Technical Requirements

Proposed Data structure & Indexing Structure