openai / chatgpt-retrieval-plugin

The ChatGPT Retrieval Plugin lets you easily find personal or work documents by asking questions in natural language.
MIT License
21.01k stars 3.68k forks source link

chunking method for "text" and "metadata" (using Pinecone) #283

Open justanotherlad opened 1 year ago

justanotherlad commented 1 year ago

The JSON file that I'm /upsert'ing contains data like the following:

{
        "id": "Series",
        "text": " Series[f,x,Subscript[x, 0],n]  generates a power series expansion for f about the point x=Subscript[x, 0] to order (x-Subscript[x, 0])^n, where n is an explicit integer. Series[f,x->Subscript[x, 0]] generates the leading term of a power series expansion for f about the point x=Subscript[x, 0]. Series[f,x,Subscript[x, 0],Subscript[n, x],y,Subscript[y, 0],Subscript[n, y],…] successively finds series expansions with respect to x, then y, etc.Some related keywords are approximate formulas, approximation of functions, approximations, asymptotic expansions, series expansions, Taylor polynomial, Maclaurin series, Taylor series, power series, Laurent series, Puiseux series, asympt, laurent, mtaylor, order, powcreate, powexp, powlog, powpoly, powser, powsolve, series, taylor, series, Taylor series. ",
        "metadata": {
            "notes": "Series can construct standard Taylor series, as well as certain expansions involving negative powers, fractional powers, and logarithms. Series detects certain essential singularities. On[Series::esss] makes Series generate a message in this case. Series can expand about the point x=∞. Series[f,{x,0,n}] constructs Taylor series for any function f according to the formula f(0)+f^′ (0)x+f^′′ (0)x^2/2+… f^(n) (0)x^n/n!. Series effectively evaluates partial derivatives using D. It assumes that different variables are independent. The result of Series is usually a SeriesData object, which you can manipulate with other functions. Normal[series] truncates a power series and converts it to a normal expression. SeriesCoefficient[series,n] finds the coefficient of the n^th-order term. The following options can be given: Analytic FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] True whether to treat unrecognized functions as analytic Assumptions FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] $Assumptions assumptions to make about parameters SeriesTermGoal Automatic number of terms in the approximation"
        }
    }

I'm using Pinecone vectorDB.

However, when I /query something like "Laurent expansion" , it returns the following:

"query": "Laurent expansion",
      "results": [
        {
          "id": "Series_1",
          "text": "related keywords are approximate formulas, approximation of functions, approximations, asymptotic expansions, series expansions, Taylor polynomial, Maclaurin series, Taylor series, power series, Laurent series, Puiseux series, asympt, laurent, mtaylor, order, powcreate, powexp, powlog, powpoly, powser, powsolve, series, taylor, series, Taylor series.",
          "metadata": {
            "notes": "Series can construct standard Taylor series, as well as certain expansions involving negative powers, fractional powers, and logarithms. Series detects certain essential singularities. On[Series::esss] makes Series generate a message in this case. Series can expand about the point x=∞. Series[f,{x,0,n}] constructs Taylor series for any function f according to the formula f(0)+f^′ (0)x+f^′′ (0)x^2/2+… f^(n) (0)x^n/n!. Series effectively evaluates partial derivatives using D. It assumes that different variables are independent. The result of Series is usually a SeriesData object, which you can manipulate with other functions. Normal[series] truncates a power series and converts it to a normal expression. SeriesCoefficient[series,n] finds the coefficient of the n^th-order term. The following options can be given: Analytic FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] True whether to treat unrecognized functions as analytic Assumptions FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] $Assumptions assumptions to make about parameters SeriesTermGoal Automatic number of terms in the approximation",
            "document_id": "Series"
          },
          "embedding": null,
          "score": 0.779730856
        }

Note: Here, a part of the "text" field is missing during retrieval (but somehow it intelligently retrieves the part which is most relevant, i.e., the last sentence of "text"). Also, however, the entire "metadata" is returned without any chunking.

I want to know how the "text" and "metadata" fields are chunked when using Pinecone vectorDB, and if the "metadata" fields are not chunked at all, then how are informations retrieved from there, i.e., is there a separate string-search mechanism or something for the "metadata" field, because afaik the embeddings are generated only using "text" field and not "metadata", but still contextual informations can be retrieved by /query which are present only in the "metadata" and not in "text".

justanotherlad commented 1 year ago

To provide some more details and proof of what I'm saying,

1) I initially added the "keywords" as one of the "metadata" s in the file for /upsert'ing , but when retrieving the most relevant documents, it had no idea what the "metadata" s contained, and as a result returned a bunch of crap results as the first few:

Screenshot from 2023-05-26 02-52-50

However, as soon as I added the "keywords" to the "text" section instead, it got picked up:

Screenshot from 2023-05-26 02-56-58

This proves that the embeddings are generated only using the "text" field, and that's what used to retrieve the most relevant documents.

2) However, when I search/ask for something that's not present in the "text" field at all, but only in the "metadata", the LLM can still give contextual answers from the "metadata":

Screenshot from 2023-05-25 22-04-25 Screenshot from 2023-05-25 22-06-51 Screenshot from 2023-05-25 22-06-57

This makes me wonder, how is the LLM searching through the "metadata" sections?