This ticket is about the data structure & OpenSearch index schema design.
Technical Requirements
The data structure should be generic enough to cover:
our evolving overall indexing strategy described in epic: #3503
all kinds of logical / physical data item
store sufficient metadata to recover the origin context & retrieve the original item
Proposed Data structure & Indexing Structure
We will only use one index for storing all LLM indexing information.
We will define the following fields:
itemType:
type: keyword
possible value:
registryRecord: a registry record. It could be a dataset, distribution, organisation etc. We likely only care about dataset & distribution initially. When itemType = 'registryRecord', the recordId & aspectId field must be present.
storageObject: indicates the index target of this index item is a storage object (file).
In future, we could add more itemType to support more use cases e.g. api. we could index API purpose plus its open API schema so it's available as tools for LLM to chose from
recordId: optional; the registry record id of the record that we index for. Only available when itemType = registryRecord
type: keyword
aspectId: optional; the aspect id of the text field that we index on. Only available when itemType = registryRecord
type: keyword
fieldName: optional; the field name of the field that we index on. Only available when itemType = registryRecord
type: keyword
fileFormat: optional; Only available when itemType = storageObject
type: keyword
subObjectId: optional; Only available when itemType = storageObject and when we need to index some non-text item. And when the in-context id of this sub-item is available.
e.g. some papers might id the first diagram as fig.1
Could also be other referenceable non-text content. e.g. data table.
subObjectType: optional; Only available when itemType = storageObject and when we need to index some referenceable non-text item.
possible value:
diagram
chart
table
index_text_chunk:
type: keyword
Please note: it's up to the indexing strategy defined in #3503 and relevant indexing strategy tickets to define how to construct the index_text_chunk.
e.g. for dataset's description field, it would be simply a text chunk of the original text content
e.g. for indexing a diagram in a PDF paper, you might want to include:
the short description of the diagram. Often underneath of the diagram
text chunks where the diagram is referenced in the paper.
embedding: store the embedding of the text chunk of the indexed text content (i.e. the content in index_text_chunk field).
type: knn_vector
dimension: 256? 512 to be decided
only_one_index_text_chunk: indicate whether the item is indexed by more than one text chuck.
type: boolean
index_text_chunk_length:
type: integer
index_text_chunk_position: the start position of the text chunk within the original full-text content
type: integer
index_text_chunk_padding: no.of chars should be cut off at the joining point for each chunk when joining more than one chunk together
Description
This ticket is about the data structure & OpenSearch index schema design.
Technical Requirements
Proposed Data structure & Indexing Structure
We will only use one index for storing all LLM indexing information.
We will define the following fields:
itemType
:keyword
registryRecord
: a registry record. It could be adataset
,distribution
,organisation
etc. We likely only care aboutdataset
&distribution
initially. WhenitemType
= 'registryRecord', therecordId
&aspectId
field must be present.storageObject
: indicates the index target of this index item is a storage object (file).itemType
to support more use cases e.g.api
. we could index API purpose plus its open API schema so it's available as tools for LLM to chose fromrecordId
:optional
; the registry record id of the record that we index for. Only available whenitemType
=registryRecord
keyword
aspectId
:optional
; the aspect id of the text field that we index on. Only available whenitemType
=registryRecord
keyword
fieldName
:optional
; the field name of the field that we index on. Only available whenitemType
=registryRecord
keyword
fileFormat
:optional
; Only available whenitemType
=storageObject
keyword
subObjectId
:optional
; Only available whenitemType
=storageObject
and when we need to index some non-text item. And when the in-context id of this sub-item is available.fig.1
subObjectType
:optional
; Only available whenitemType
=storageObject
and when we need to index some referenceable non-text item.diagram
chart
table
index_text_chunk
:keyword
index_text_chunk
.embedding
: store the embedding of the text chunk of the indexed text content (i.e. the content inindex_text_chunk
field).knn_vector
only_one_index_text_chunk
: indicate whether the item is indexed by more than one text chuck.index_text_chunk_length
:index_text_chunk_position
: the start position of the text chunk within the original full-text contentindex_text_chunk_padding
: no.of chars should be cut off at the joining point for each chunk when joining more than one chunk together