langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.78k stars 2.2k forks source link

Unable to upsert into LanceDB with NodeJS #3158

Closed giusedroid closed 8 months ago

giusedroid commented 1 year ago

Hey folks! While working on a sample with NodeJS & LangChain I bumped into this error. I'm importing docs via the standard pdf loader

Error type 1: id of generated fileds is different, but the source documents are the same.

original=Field(id=0, name=vector, type=fixed_size_list:float:1536)\nField(id=1, name=text, type=string)\nField(id=2, name=source, type=string)\nField(id=3, name=pdf, type=struct)\nField(id=5, name=loc, type=struct)\n 
    new=Field(id=0, name=vector, type=fixed_size_list:float:1536)\nField(id=1, name=text, type=string)\nField(id=2, name=source, type=string)\nField(id=3, name=pdf, type=struct)\nField(id=30, name=loc, type=struct)\n",

Error type 2: order of fields are swapped for some reasons

original=Field(id=0, name=vector, type=fixed_size_list:float:1536)\nField(id=1, name=text, type=string)\nField(id=2, name=source, type=string)\nField(id=3, name=pdf, type=struct)\nField(id=5, name=loc, type=struct)\n 
    new=Field(id=0, name=vector, type=fixed_size_list:float:1536)\nField(id=1, name=text, type=string)\nField(id=2, name=pdf, type=struct)\nField(id=3, name=source, type=string)\nField(id=30, name=loc, type=struct)\n",

at least for the second error type, this is the line where I think the errors are coming from

https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/vectorstores/lancedb.ts#L69

Object.keys(documents[i].metadata).forEach((metaKey) => {

I believe that the error may be due to the fact that Object.keys returns keys in the order in which they appear in the object, and sometimes the PDF loader may build the object in a different way? My suggestion would be to implement an order guarantee such as

Object.keys(documents[i].metadata).sort().forEach((metaKey) => {

Here's the code that has produced the error

export const handler = async (event) => {
  // The S3 event contains details about the uploaded object
  const bucketName = event.Records[0].s3.bucket.name
  const objectKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '))
  const filePath = `/tmp/${objectKey}`

  await createDirectory() // creates a directory in /tmp/embeddings
​
  await downloadObject(bucketName, objectKey, filePath) // downloads a file from S3

  let loader, docs, output

  try{

    loader = new PDFLoader(filePath, {
      splitPages: false
    })
    docs = await loader.loadAndSplit(splitter)
​
  }catch(error){
    console.error('Error loading documents:', error)
    return returnError(error)
  }

  const dir = `s3://${lanceDbSrc}/embeddings`

  let db, table, createTable = false

  try{
    db = await connect(dir)
  }catch(error){
    console.error('Error connecting to LanceDB:', error)
    return returnError(error)
  }

  try{
    table = await db.openTable(lanceDbTable)
  }catch(error){
    createTable = true
    console.log('Table not found with error', error)
  }

  if(createTable){
    console.log(`${lanceDbTable} table not found. Creating it.`)

    try{
      table = await db.createTable(lanceDbTable, [
        { 
          vector: Array(1536), 
          text: 'sample', 
          source: 'string',
          pdf: {key: "value"},
          loc: {key: "value"}
        },
      ])
    }catch(error){
        console.error(
          `Error connecting to LanceDB table ${lanceDbTable} :`,
          error
        )
        return returnError(error)
    }
  }

  await LanceDB.fromDocuments(docs, embeddings, { table })

  return {
    statusCode: 201,
    body: JSON.stringify({
      message: 'OK'
    })
  }
​
}

I had to remove all metadata to make it work, like so

  docs = docs.map(doc => ({
    pageContent: doc.pageContent,
    metadata: {}
  }))

can you please have a look?

dosubot[bot] commented 1 year ago

🤖

Hello @giusedroid!

I'm Dosu, an assistant bot here to help you out while we wait for a human maintainer to come along. I can help with resolving bugs, answering questions, and guiding you if you're interested in contributing to our repository.

Just letting you know that I'm currently looking into your issue and will provide a detailed response shortly.

dosubot[bot] commented 1 year ago

🤖

Thank you for providing detailed information about the issue you're encountering. It seems like the problem is related to the way JavaScript handles object keys. As you've correctly pointed out, JavaScript's Object.keys() method does not guarantee a consistent order of keys, which might be causing the discrepancies in field IDs and order you're observing.

Your suggestion to sort the keys before iterating over them could indeed help ensure a consistent order:

Object.keys(documents[i].metadata).sort().forEach((metaKey) => {
  // ...
});

However, this would only ensure a consistent order if the keys are strings and not symbols, as Object.keys() only returns an array of an object's own enumerable property names, iterated in the order that a normal loop would. If the keys are symbols, you might need to use Reflect.ownKeys() instead, which returns all keys, including non-enumerable and symbol keys.

As for the issue with the id of generated fields being different, it's not immediately clear why this is happening based on the provided code and error messages. The id should be automatically generated and incremented by LanceDB when you insert a new record, so it's strange that you're seeing different id values for the same source documents.

One possible explanation could be that there's some kind of race condition or asynchronous operation that's causing the documents to be inserted more than once, resulting in different id values. However, without more information about how you're using the fromDocuments method and what your documents array looks like, it's hard to say for sure.

If you could provide more details about your usage of the fromDocuments method and the structure of your documents array, it would be easier to diagnose the problem. For example, are you calling fromDocuments multiple times in quick succession, or are you doing anything else that might cause the documents to be inserted more than once? Also, are all the documents in your documents array unique, or could there be duplicates?

I hope this helps, and I look forward to your response.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

airtonix commented 12 months ago

Getting similar error:

services/vectordb/types.ts

import { Document } from "langchain/document";

export interface IVectorDb<TStore> {
    documents: Document<Record<string, any>>[]

    GetStore(): Promise<TStore>

    ImportDocuments(docs: Document<Record<string, any>>[]): Promise<void>

    MakeRetriever(): any
}

services/vectordb/lancedb.ts

import { Table, WriteMode, connect } from 'vectordb';
import { LanceDB } from 'langchain/vectorstores/lancedb';
import { OllamaEmbeddings } from 'langchain/embeddings/ollama';
import { environment } from '../config';
import { IVectorDb } from './types';
import type { Document } from 'langchain/document';

export const LanceStore = await InitLanceStore();

async function InitLanceStore() {
  try {
    console.log('initializing LanceDB...')

    const db = await connect(environment.APP_VECTORDB_FILE);
    let table: Table<number[]>;

    const store: IVectorDb<LanceDB> = {
      documents: [
        {
          pageContent: '',
          metadata: {},
        },
      ],

      async GetStore() {
        console.log('opening table %s...', environment.APP_VECTORDB_TABLE);
        table = await db.openTable(environment.APP_VECTORDB_TABLE);

        const store = LanceDB.fromDocuments(
          this.documents,
          new OllamaEmbeddings(),
          { table },
        );
        return store;
      },

      async ImportDocuments(documents) {
        //embed the PDF documents
        console.log('creating table %s...', environment.APP_VECTORDB_TABLE);
        table = await db.createTable(
          environment.APP_VECTORDB_TABLE,
          [
            {
              vector: Array(1536), 
              text: 'sample', 
              source: 'string',
              pdf: {key: "value"},
              loc: {key: "value"}
            },
          ],
          { writeMode: WriteMode.Overwrite }
        );
        const embeddings = new OllamaEmbeddings();

        await LanceDB.fromDocuments(documents, embeddings, {
          table,
          textKey: 'text',
        });
        console.log('ingested %s documents', documents.length);

      },

      async MakeRetriever() {
        const vectorStore = await this.GetStore();
        // Use a callback to get intermediate sources from the middle of the chain
        let resolveWithDocuments: (value: Document[]) => void;
        const documentPromise = new Promise<Document[]>((resolve) => {
          resolveWithDocuments = resolve;
        });
        const retriever = vectorStore.asRetriever({
          callbacks: [
            {
              handleRetrieverEnd(documents) {
                resolveWithDocuments(documents);
              },
            },
          ],
        });
        return {retriever, documentPromise};
      },
    };

    return store;
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to initialize LanceDB');
  }
}

tasks/import.ts

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { LanceStore } from '@/services/vectordb/landcedb';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';
import { environment } from '@/services/config';

(async () => {

  /*load raw docs from the all files in the directory */
  const directoryLoader = new DirectoryLoader(environment.APP_DOCUMENT_PATH, {
    '.pdf': (path) => new PDFLoader(path),
  });

  // const loader = new PDFLoader(filePath);
  const raw = await directoryLoader.load();

  /* Split text into chunks */
  const textSplitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
  });

  const documents = (await textSplitter.splitDocuments(raw))
    .map(doc => ({
      pageContent: doc.pageContent,
      metadata: {}
    }));

  console.log('creating vector store...');
  await LanceStore.ImportDocuments(documents);

  console.log('import complete');
})();

services/config/index.ts

import { UnifiedEnv } from 'unified-env'

export const environment = new UnifiedEnv({
    APP_DOCUMENT_PATH: { required: true, defaultValue: './docs'},
    APP_VECTORDB_FILE: { required: true, defaultValue: './db/vectordb'},
    APP_VECTORDB_TABLE: { required: true, defaultValue: 'ollama'}
  })
    .env() // parse `process.env`
    .argv() // parse `process.argv`
    .file({ filePath: './.env' }) // parse an `.env` file (relative path)
    .generate(); // generate the environment object

results in :

gpt4-pdf-chatbot-langchain on git main [x!?] via nodejs v21.2.0 via nix impure (nix-shell-env) took 7s 
x yarn import
initializing LanceDB...
creating vector store...
creating table ollama...
node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: LanceDBError: Append with different schema:]

Node.js v18.18.2

Screencast from 2023-11-30 08-48-19.webm

airtonix commented 12 months ago

ok so the problem for me was this:

when initialising my LanceDB, it wants a first record (to define the schema I guess):

       table = await db.createTable(
          environment.APP_VECTORDB_TABLE,
          [
            {
              vector: Array(4096),
              pageContent: '',
            },
          ],
          { writeMode: WriteMode.Overwrite },
        );

Problem here was that I am a complete mouth breather and thus am performing explortory learning by breaking things... all i got is tutorials which say do the above.

So you start off using the LanceDB.fromDocuments static method:

LanceDB.fromDocuments(chunks, embeddings, {
          table,
          textKey: 'pageContent',
        });

The textKey made me think that it would just add content to that only and just magically handle the rest of the data... kind a weird but ok.

yeah nah... not ok.

I guess if you don't use the text splitter then you'll be fine, but if you do, then you now have all this necessary metadata. 👍🏻

Which means that if you started off with an initial first row of:

            {
              vector: Array(4096),
              pageContent: '',
            },

and your file loader gives you this kind of record:

Document<{
  pageContent: string,
  metaData: {
    source: string
  }
}>

then putting that into LanceDB.fromDocuments will fail unless your first row looks like:

            {
              vector: Array(4096),
              pageContent: '',
              source: '',
            },

the fun doesn't stop though, because dogma says you should be splitting your documents, so lets 🪓 with textSplitter.

So now, before going into LanceDB your records look like:

            {
              pageContent: '',
              metaData: {
                source: '',
                loc: {
                  lines: {
                    from: 1,
                    to: 32
                  }
                }
              }
            },

now when you put this through LanceDB.fromDocuments, it calls LanceDB.addVectors which is: ☝🏻

https://github.com/langchain-ai/langchainjs/blob/993ce7a822bd11e125419bff84a9951bd58e66e5/langchain/src/vectorstores/lancedb.ts#L69-L71

which transforms your records to this:

              pageContent: '',
              source: '',
              loc: {
                lines: {
                  from: 1,
                  to: 1,
                },
              },

which fails because your first row doesn't look like:

              vector: Array(4096),
              pageContent: '',
              source: '',
              loc: {
                lines: {
                  from: 1,
                  to: 1,
                },
              },

amazing right?

moral of the story is:

tldr for @giusedroid

your first row is probably missing more information from your loc entry. I'd bust out the javascript debugging terminal in vscode and checkout what the complete shape of an item in your docs looks like:

    docs = await loader.loadAndSplit(splitter)

then reflect that in your first row (minus the metadata key)

      table = await db.createTable(lanceDbTable, [
        { 
          vector: Array(1536), 
          text: 'sample', 
          source: 'string',
          pdf: {key: "value"},
          loc: {key: "value"}
        },
      ])

i suspect it needs to be :

      table = await db.createTable(lanceDbTable, [
        { 
          vector: Array(1536), 
          text: 'sample', 1️⃣ 
          source: 'string', 2️⃣ 
          pdf: {key: "value"}, 3️⃣ 
          loc: {
            lines: {
               from: 1,
               to: 2
            }
          }
        },
      ])
dosubot[bot] commented 8 months ago

🤖

Hi, @giusedroid,

I'm helping the langchainjs team manage their backlog and am marking this issue as stale. From what I understand, you encountered errors related to upserting into LanceDB with NodeJS, specifically regarding differences in generated field IDs and swapped field orders. You suggested that the issue might be related to the order in which keys are returned by Object.keys and proposed implementing an order guarantee by sorting the keys. Additionally, you provided code that produced the error and a workaround by removing all metadata. I and airtonix provided detailed responses and explanations, including potential solutions and insights into similar errors they encountered.

Could you please confirm if this issue is still relevant to the latest version of the langchainjs repository? If it is, please let the langchainjs team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!