pinecone-io / examples

Jupyter Notebooks to help you get hands-on with Pinecone vector databases
MIT License
2.69k stars 1.01k forks source link

chunking system for long upserted files #197

Closed justanotherlad closed 1 year ago

justanotherlad commented 1 year ago

I have /upsert'ed a document (JSON) using chatgpt-retrieval-plugin that looks like the following:

{
        "id": "Series",
        "text": " Series[f,x,Subscript[x, 0],n]  generates a power series expansion for f about the point x=Subscript[x, 0] to order (x-Subscript[x, 0])^n, where n is an explicit integer. Series[f,x->Subscript[x, 0]] generates the leading term of a power series expansion for f about the point x=Subscript[x, 0]. Series[f,x,Subscript[x, 0],Subscript[n, x],y,Subscript[y, 0],Subscript[n, y],…] successively finds series expansions with respect to x, then y, etc.Some related keywords are approximate formulas, approximation of functions, approximations, asymptotic expansions, series expansions, Taylor polynomial, Maclaurin series, Taylor series, power series, Laurent series, Puiseux series, asympt, laurent, mtaylor, order, powcreate, powexp, powlog, powpoly, powser, powsolve, series, taylor, series, Taylor series. ",
        "metadata": {
            "notes": "Series can construct standard Taylor series, as well as certain expansions involving negative powers, fractional powers, and logarithms. Series detects certain essential singularities. On[Series::esss] makes Series generate a message in this case. Series can expand about the point x=∞. Series[f,{x,0,n}] constructs Taylor series for any function f according to the formula f(0)+f^′ (0)x+f^′′ (0)x^2/2+… f^(n) (0)x^n/n!. Series effectively evaluates partial derivatives using D. It assumes that different variables are independent. The result of Series is usually a SeriesData object, which you can manipulate with other functions. Normal[series] truncates a power series and converts it to a normal expression. SeriesCoefficient[series,n] finds the coefficient of the n^th-order term. The following options can be given: Analytic FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] True whether to treat unrecognized functions as analytic Assumptions FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] $Assumptions assumptions to make about parameters SeriesTermGoal Automatic number of terms in the approximation"
        }
    }

However, when I /query, it returns something like this:

"query": "Laurent expansion",
      "results": [
        {
          "id": "Series_1",
          "text": "related keywords are approximate formulas, approximation of functions, approximations, asymptotic expansions, series expansions, Taylor polynomial, Maclaurin series, Taylor series, power series, Laurent series, Puiseux series, asympt, laurent, mtaylor, order, powcreate, powexp, powlog, powpoly, powser, powsolve, series, taylor, series, Taylor series.",
          "metadata": {
            "notes": "Series can construct standard Taylor series, as well as certain expansions involving negative powers, fractional powers, and logarithms. Series detects certain essential singularities. On[Series::esss] makes Series generate a message in this case. Series can expand about the point x=∞. Series[f,{x,0,n}] constructs Taylor series for any function f according to the formula f(0)+f^′ (0)x+f^′′ (0)x^2/2+… f^(n) (0)x^n/n!. Series effectively evaluates partial derivatives using D. It assumes that different variables are independent. The result of Series is usually a SeriesData object, which you can manipulate with other functions. Normal[series] truncates a power series and converts it to a normal expression. SeriesCoefficient[series,n] finds the coefficient of the n^th-order term. The following options can be given: Analytic FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] True whether to treat unrecognized functions as analytic Assumptions FEPrivate`ImportImage[FrontEnd`FileName[{Documentation, Miscellaneous}, ExampleJumpLink.png]] $Assumptions assumptions to make about parameters SeriesTermGoal Automatic number of terms in the approximation",
            "document_id": "Series"
          },
          "embedding": null,
          "score": 0.779730856
        }

Note: Here, first part of the "text" field is missing. What sort of chunking mechanism is being used here? Is it the RecursiveCharacterTextSplitter explained here ?

Also, if so, what sort of context-overlapping method does it use if one part of answer for the query lies in the first chunk and the second part lies in the next chunk?

DosticJelena commented 1 year ago

Hello! It isn't clear from this part of the answer alone how the chunking strategy used in this example is figured out. Various chunking strategies can yield similar results, depending on the configured options. You can find more details here: https://www.pinecone.io/learn/chunking-strategies/.

Moving on to the next part of the question, essentially, by using the context-overlapping method, we ensure that each chunk retains a portion of the context from the previous and subsequent chunks. This approach should ideally maintain the answer within a single chunk every time. You can adjust the length of these overlapping segments and also increase the chunk size if you expect answers with a larger amount of text.