Structuring data docs - Githubissues

cjcenizal commented 1 year ago

Background

I had a chat with @amin2718 and he dropped the 411 on me regarding how to structure data, how the structure affects how the data is indexed, and the resulting retrieval behavior.

CC @pwoznic @ofermend @justinhayes

Problem

We looked for some docs that cover this topic, and the closest I could find is https://docs.vectara.com/docs/api-reference/indexing-apis/indexing. This is problematic because:

This content is buried deep in the information tree.
It's hidden inside a page called "API Definition".
It lacks an example JSON object that describes a realistic shape.

This means this information is not discoverable and also challenging to apply.

Explanation

Note: I've left open questions in a quote block, like this.

When the user uploads a file like a PDF, Vectara naively parses it for ingestion. This means that although it will capture all of the text contained in the document, it might make parsing mistakes and it will be unable to extract any structural information. In other words, the data the ends up in the corpus will lose any meaningful relationships between the bits of data in the original file.

The benefits to munging files into a structured data format are:

The relationships of the bits of data can be preserved.
The special meaning of specific types of data is retained, such as dates and titles.
Users can query the data with filters.

Let's use this PDF as an example:

www.techtransfer.nih.gov_tech_tab-3843.pdf

Vectara offers a structured data format. I converted the above PDF into this format below. The longer bits of text have been truncated for brevity.

{
  "documentId": "TAB‑3843",
  "title": "Engineered Cell‑Penetrating Monoclonal Antibody for Universal Inuenza Immunotherapy",
  "description": "Home » Tech » Engineered Cell‑Penetrating Monoclonal Antibody for Universal Inuenza Immunotherapy",
  "metadataJson": "{\"developmentStatus\":\"Pre‑Clinical\",\"isAntibodiesProduct\":true,\"date\":\"2023‑05‑17\",\"patentSeriesCode\":63,\"patentApplicationNumber\":365841}",
  "section": [{
    "title": "body",
    "text": "Influenza remains a burden on public health..."
  }, {
    "title": "Clinical treatment",
    "text": "Clinical Treatment꞉ CPP‑mAbs against inuenza NP may...",
    "metadataJson": "{\"clinicalTreatment\"꞉\"CPP‑mAbs against inuenza NP may...\"}",
  }, {
    "text": "Current vaccines remain effective for a short time period..."
  }]
}

This structure is informed by three concepts:

Document
Metadata
Sections

Document

In this example, the document is identified by:

documentId
title
description

This is the high-level information that is encoded into vectors when Vectara ingests the document. This information can be retrieved using semantic, keyword-based, and hybrid search.

Question: Can documentId, title, and description all be used in filters? Can they be used to semantically match queries? What's the point of description -- why not just smush everything into title? Is it perhaps because splitting it up into title and description is more useful for a human who's reviewing the document?

Metadata

The document has metadata attached to it via the metadataJson property. This property expects to be assigned a stringified JSON object. This object consists of arbitrary key-value pairs, accepting text, numeric, and boolean values.

Question: Does this object accept text, numeric, and boolean values? Does it accept any other types of data?

In this example, we've picked various properties out of the original PDF that will be useful for:

Filtering
Cross-referencing a document with other data sources
Comparing and grouping results

Question: I made up "comparing and grouping results". Is that accurate? How would one go about doing that? Are there other use cases for metadata?

We defined these particular metadata properties on the document level instead of the section level (see below) because we intend to use them to retrieve the entire document and not just a part of it.

Question: "we intend to use them to retrieve the entire document and not just a part of it" -- is that accurate?

developmentStatus: This tells us the status of the patent, e.g. pre-clinical.
isAntibodiesProduct: This tells us whether the patent applies to antibodies-related products, which is the domain we care about in this contrived example.
date: The date this document was created.
patentSeriesCode: The patent series code number.
patentApplicationNumber: The patent application umber.

Metadata can also be attached to sections, covered next.

Sections

Sections are an organizational unit for grouping related bodies of text. A section is defined by:

text: The body of text.
title: An optional name for identifying the body of text. Similar to a heading in a document.
metadataJson: An optional stringified JSON object, which can be configured as flexibly as the root-level document metadata.
sections: An optional array of child sections. Those sections can, in turn, have their own child sections.

When Vectara ingests a document, it will split the text in these sections into chunks and encode them in vectors. This enables queries to retrieve them based on semantic similarity.

Question: Why organize content into multiple/nested sections instead of smushing everything into a single section, or even into the description property? Is there a functional or behavioral difference? Or is it simply to help the human who might be manually reviewing a document?

cjcenizal commented 1 year ago

CC @tallatshafaat

amin2718 commented 1 year ago

The other page that's relevant here, and that I was having trouble finding the other day (I actually found it via Google Search :-) ), is this one:

Format JSON and Pbtext Files - https://docs.vectara.com/docs/api-reference/indexing-apis/file-upload/format-for-upload https://docs.vectara.com/docs/api-reference/indexing-apis/file-upload/format-for-upload

It includes a concrete example of formatting a Shakespeare play.

On Wed, Oct 11, 2023 at 11:43 AM CJ Cenizal @.***> wrote:

Background

I had a chat with @amin2718 https://github.com/amin2718 and he dropped the 411 on me regarding how to structure data, how the structure affects how the data is indexed, and the resulting retrieval behavior.

CC @pwoznic https://github.com/pwoznic @ofermend https://github.com/ofermend @justinhayes https://github.com/justinhayes Problem

We looked for some docs that cover this topic, and the closest I could find is https://docs.vectara.com/docs/api-reference/indexing-apis/indexing. This is problematic because:

This content is buried deep in the information tree.

It's hidden inside a page called "API Definition".

It lacks an example JSON object that describes a realistic shape.

This means this information is not discoverable and also challenging to apply. Explanation

Note: I've left open questions in a quote block, like this.

When the user uploads a file like a PDF, Vectara naively parses it for ingestion. This means that although it will capture all of the text contained in the document, it might make parsing mistakes and it will be unable to extract any structural information. In other words, the data the ends up in the corpus will lose any meaningful relationships between the bits of data in the original file.

The benefits to munging files into a structured data format are:

The relationships of the bits of data can be preserved.

The special meaning of specific types of data is retained, such as dates and titles.

Users can query the data with filters.

Let's use this PDF as an example:

www.techtransfer.nih.gov_tech_tab-3843.pdf https://github.com/vectara/vectara-docs/files/12862357/www.techtransfer.nih.gov_tech_tab-3843.pdf

Vectara offers a structured data format. An example of this format is below. The longer bits of text have been truncated for brevity.

{ "documentId": "TAB‑3843", "title": "Engineered Cell‑Penetrating Monoclonal Antibody for Universal Inuenza Immunotherapy", "description": "Home » Tech » Engineered Cell‑Penetrating Monoclonal Antibody for Universal Inuenza Immunotherapy", "metadataJson": "{\"developmentStatus\":\"Pre‑Clinical\",\"isAntibodiesProduct\":true,\"date\":\"2023‑05‑17\",\"patentSeriesCode\":63,\"patentApplicationNumber\":365841}", "section": [{ "title": "body", "text": "Influenza remains a burden on public health..." }, { "title": "Clinical treatment", "text": "Clinical Treatment꞉ CPP‑mAbs against inuenza NP may...", "metadataJson": "{\"clinicalTreatment\"꞉\"CPP‑mAbs against inuenza NP may...\"}", }, { "text": "Current vaccines remain effective for a short time period..." }] }

This structure is informed by three concepts:

Document

Metadata

Sections

Document

In this example, the document is identified by:

documentId

title

description

This is the high-level information that is encoded into vectors when Vectara ingests the document. This information can be retrieved using semantic, keyword-based, and hybrid search.

Question: Can documentId, title, and description all be used in filters? Can they be used to semantically match queries? What's the point of description -- why not just smush everything into title? Is it perhaps because splitting it up into title and description is more useful for a human who's reviewing the document?

Metadata

The document has metadata attached to it via the metadataJson property. This property expects to be assigned a stringified JSON object. This object consists of arbitrary key-value pairs, accepting text, numeric, and boolean values.

Question: Does this object accept text, numeric, and boolean values? Does it accept any other types of data?

In this example, we've picked various properties out of the original PDF that will be useful for:

Filtering

Cross-referencing a document with other data sources

Comparing and grouping results

Question: I made up "comparing and grouping results". Is that accurate? How would one go about doing that? Are there other use cases for metadata?

We defined these particular metadata properties on the document level instead of the section level (see below) because we intend to use them to retrieve the entire document and not just a part of it.

Question: "we intend to use them to retrieve the entire document and not just a part of it" -- is that accurate?

developmentStatus: This tells us the status of the patent, e.g. pre-clinical.

isAntibodiesProduct: This tells us whether the patent applies to antibodies-related products, which is the domain we care about in this contrived example.

date: The date this document was created.

patentSeriesCode: The patent series code number.

patentApplicationNumber: The patent application umber.

Metadata can also be attached to sections, covered next. Sections

Sections are an organizational unit for grouping related bodies of text. A section is defined by:

text: The body of text.

title: An optional name for identifying the body of text. Similar to a heading in a document.

metadataJson: An optional stringified JSON object, which can be configured as flexibly as the root-level document metadata.

sections: An optional array of child sections. Those sections can, in turn, have their own child sections.

When Vectara ingests a document, it will split the text in these sections into chunks and encode them in vectors. This enables queries to retrieve them based on semantic similarity.

Question: Why organize content into multiple/nested sections instead of smushing everything into a single section, or even into the description property? Is there a functional or behavioral difference? Or is it simply to help the human who might be manually reviewing a document?

— Reply to this email directly, view it on GitHub https://github.com/vectara/vectara-docs/issues/108, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATZSFDDE3F5HJRTIRTR6KWTX63SGRANCNFSM6AAAAAA54ODEOQ . You are receiving this because you were mentioned.Message ID: @.***>

ofermend commented 1 year ago

For someone coming-in without PB background the title here "Format JSON and PBText files" may be a bit confusing. People know what JSON is but many don't know what PBText is.

"Alternatively, you may perform the text extraction yourself, and save the result as a JSON or text serialized Document proto. The benefit of this approach is that you can attach your own metadata to the document, or to individual sections within it."

So a developer unfamiliar with PB may be confused again: what is "text serialized Document Proto"? is that the same as PBText? Now I have to go and learn about what that is.

Would be great to give a bit more context to the two types of options in our own words, or at least explain better.

cjcenizal commented 1 year ago

We can also create another section or page about "special metadata".

Special metadata

Vectara Console recognizes special metadata which have proven useful across many use cases.

date

If you define date in the document's metadata, it will appear in the Console Corpus Search interface. This can be useful for tracking the recency of a document, since older docs can lose relevance in some scenarios.

url

If you define url in the document's metadata, it will appear in the Console Corpus Search interface as a clickable link. This can be useful for enabling users to click through to the document's original resource, e.g. a web page or downloadable artifact.

ts_create

If you define ts_create and define a creation date in epoch seconds, it will appear in the Console Corpus Search interface as the document's date of creation.

author

If you define author and define either a string or an array of strings, these values will appear in the Console Corpus Search interface as the document's author(s).

vectara / vectara-docs

Structuring data docs #108

Background

Problem

Explanation

Document

Metadata

Sections

Special metadata

date

url

ts_create

author