Closed giusedroid closed 8 months ago
🤖
Hello @giusedroid!
I'm Dosu, an assistant bot here to help you out while we wait for a human maintainer to come along. I can help with resolving bugs, answering questions, and guiding you if you're interested in contributing to our repository.
Just letting you know that I'm currently looking into your issue and will provide a detailed response shortly.
🤖
Thank you for providing detailed information about the issue you're encountering. It seems like the problem is related to the way JavaScript handles object keys. As you've correctly pointed out, JavaScript's Object.keys()
method does not guarantee a consistent order of keys, which might be causing the discrepancies in field IDs and order you're observing.
Your suggestion to sort the keys before iterating over them could indeed help ensure a consistent order:
Object.keys(documents[i].metadata).sort().forEach((metaKey) => {
// ...
});
However, this would only ensure a consistent order if the keys are strings and not symbols, as Object.keys()
only returns an array of an object's own enumerable property names, iterated in the order that a normal loop would. If the keys are symbols, you might need to use Reflect.ownKeys()
instead, which returns all keys, including non-enumerable and symbol keys.
As for the issue with the id
of generated fields being different, it's not immediately clear why this is happening based on the provided code and error messages. The id
should be automatically generated and incremented by LanceDB when you insert a new record, so it's strange that you're seeing different id
values for the same source documents.
One possible explanation could be that there's some kind of race condition or asynchronous operation that's causing the documents to be inserted more than once, resulting in different id
values. However, without more information about how you're using the fromDocuments
method and what your documents
array looks like, it's hard to say for sure.
If you could provide more details about your usage of the fromDocuments
method and the structure of your documents
array, it would be easier to diagnose the problem. For example, are you calling fromDocuments
multiple times in quick succession, or are you doing anything else that might cause the documents to be inserted more than once? Also, are all the documents in your documents
array unique, or could there be duplicates?
I hope this helps, and I look forward to your response.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Getting similar error:
services/vectordb/types.ts
import { Document } from "langchain/document";
export interface IVectorDb<TStore> {
documents: Document<Record<string, any>>[]
GetStore(): Promise<TStore>
ImportDocuments(docs: Document<Record<string, any>>[]): Promise<void>
MakeRetriever(): any
}
services/vectordb/lancedb.ts
import { Table, WriteMode, connect } from 'vectordb';
import { LanceDB } from 'langchain/vectorstores/lancedb';
import { OllamaEmbeddings } from 'langchain/embeddings/ollama';
import { environment } from '../config';
import { IVectorDb } from './types';
import type { Document } from 'langchain/document';
export const LanceStore = await InitLanceStore();
async function InitLanceStore() {
try {
console.log('initializing LanceDB...')
const db = await connect(environment.APP_VECTORDB_FILE);
let table: Table<number[]>;
const store: IVectorDb<LanceDB> = {
documents: [
{
pageContent: '',
metadata: {},
},
],
async GetStore() {
console.log('opening table %s...', environment.APP_VECTORDB_TABLE);
table = await db.openTable(environment.APP_VECTORDB_TABLE);
const store = LanceDB.fromDocuments(
this.documents,
new OllamaEmbeddings(),
{ table },
);
return store;
},
async ImportDocuments(documents) {
//embed the PDF documents
console.log('creating table %s...', environment.APP_VECTORDB_TABLE);
table = await db.createTable(
environment.APP_VECTORDB_TABLE,
[
{
vector: Array(1536),
text: 'sample',
source: 'string',
pdf: {key: "value"},
loc: {key: "value"}
},
],
{ writeMode: WriteMode.Overwrite }
);
const embeddings = new OllamaEmbeddings();
await LanceDB.fromDocuments(documents, embeddings, {
table,
textKey: 'text',
});
console.log('ingested %s documents', documents.length);
},
async MakeRetriever() {
const vectorStore = await this.GetStore();
// Use a callback to get intermediate sources from the middle of the chain
let resolveWithDocuments: (value: Document[]) => void;
const documentPromise = new Promise<Document[]>((resolve) => {
resolveWithDocuments = resolve;
});
const retriever = vectorStore.asRetriever({
callbacks: [
{
handleRetrieverEnd(documents) {
resolveWithDocuments(documents);
},
},
],
});
return {retriever, documentPromise};
},
};
return store;
} catch (error) {
console.log('error', error);
throw new Error('Failed to initialize LanceDB');
}
}
tasks/import.ts
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { LanceStore } from '@/services/vectordb/landcedb';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';
import { environment } from '@/services/config';
(async () => {
/*load raw docs from the all files in the directory */
const directoryLoader = new DirectoryLoader(environment.APP_DOCUMENT_PATH, {
'.pdf': (path) => new PDFLoader(path),
});
// const loader = new PDFLoader(filePath);
const raw = await directoryLoader.load();
/* Split text into chunks */
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const documents = (await textSplitter.splitDocuments(raw))
.map(doc => ({
pageContent: doc.pageContent,
metadata: {}
}));
console.log('creating vector store...');
await LanceStore.ImportDocuments(documents);
console.log('import complete');
})();
services/config/index.ts
import { UnifiedEnv } from 'unified-env'
export const environment = new UnifiedEnv({
APP_DOCUMENT_PATH: { required: true, defaultValue: './docs'},
APP_VECTORDB_FILE: { required: true, defaultValue: './db/vectordb'},
APP_VECTORDB_TABLE: { required: true, defaultValue: 'ollama'}
})
.env() // parse `process.env`
.argv() // parse `process.argv`
.file({ filePath: './.env' }) // parse an `.env` file (relative path)
.generate(); // generate the environment object
results in :
gpt4-pdf-chatbot-langchain on git main [x!?] via nodejs v21.2.0 via nix impure (nix-shell-env) took 7s
x yarn import
initializing LanceDB...
creating vector store...
creating table ollama...
node:internal/process/promises:288
triggerUncaughtException(err, true /* fromPromise */);
^
[Error: LanceDBError: Append with different schema:]
Node.js v18.18.2
ok so the problem for me was this:
when initialising my LanceDB, it wants a first record (to define the schema I guess):
table = await db.createTable(
environment.APP_VECTORDB_TABLE,
[
{
vector: Array(4096),
pageContent: '',
},
],
{ writeMode: WriteMode.Overwrite },
);
Problem here was that I am a complete mouth breather and thus am performing explortory learning by breaking things... all i got is tutorials which say do the above.
So you start off using the LanceDB.fromDocuments
static method:
LanceDB.fromDocuments(chunks, embeddings, {
table,
textKey: 'pageContent',
});
The textKey
made me think that it would just add content to that only and just magically handle the rest of the data... kind a weird but ok.
yeah nah... not ok.
I guess if you don't use the text splitter then you'll be fine, but if you do, then you now have all this necessary metadata. 👍🏻
Which means that if you started off with an initial first row of:
{
vector: Array(4096),
pageContent: '',
},
and your file loader gives you this kind of record:
Document<{
pageContent: string,
metaData: {
source: string
}
}>
then putting that into LanceDB.fromDocuments
will fail unless your first row looks like:
{
vector: Array(4096),
pageContent: '',
source: '',
},
the fun doesn't stop though, because dogma says you should be splitting your documents, so lets 🪓 with textSplitter.
So now, before going into LanceDB your records look like:
{
pageContent: '',
metaData: {
source: '',
loc: {
lines: {
from: 1,
to: 32
}
}
}
},
now when you put this through LanceDB.fromDocuments
, it calls LanceDB.addVectors
which is: ☝🏻
which transforms your records to this:
pageContent: '',
source: '',
loc: {
lines: {
from: 1,
to: 1,
},
},
which fails because your first row doesn't look like:
vector: Array(4096),
pageContent: '',
source: '',
loc: {
lines: {
from: 1,
to: 1,
},
},
amazing right?
moral of the story is:
.fromDocuments
tldr for @giusedroid
your first row is probably missing more information from your loc
entry. I'd bust out the javascript debugging terminal in vscode and checkout what the complete shape of an item in your docs
looks like:
docs = await loader.loadAndSplit(splitter)
then reflect that in your first row (minus the metadata
key)
table = await db.createTable(lanceDbTable, [
{
vector: Array(1536),
text: 'sample',
source: 'string',
pdf: {key: "value"},
loc: {key: "value"}
},
])
i suspect it needs to be :
table = await db.createTable(lanceDbTable, [
{
vector: Array(1536),
text: 'sample', 1️⃣
source: 'string', 2️⃣
pdf: {key: "value"}, 3️⃣
loc: {
lines: {
from: 1,
to: 2
}
}
},
])
pageContent
instead.🤖
Hi, @giusedroid,
I'm helping the langchainjs team manage their backlog and am marking this issue as stale. From what I understand, you encountered errors related to upserting into LanceDB with NodeJS, specifically regarding differences in generated field IDs and swapped field orders. You suggested that the issue might be related to the order in which keys are returned by Object.keys and proposed implementing an order guarantee by sorting the keys. Additionally, you provided code that produced the error and a workaround by removing all metadata. I and airtonix provided detailed responses and explanations, including potential solutions and insights into similar errors they encountered.
Could you please confirm if this issue is still relevant to the latest version of the langchainjs repository? If it is, please let the langchainjs team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!
Hey folks! While working on a sample with NodeJS & LangChain I bumped into this error. I'm importing docs via the standard pdf loader
Error type 1: id of generated fileds is different, but the source documents are the same.
Error type 2: order of fields are swapped for some reasons
at least for the second error type, this is the line where I think the errors are coming from
https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/vectorstores/lancedb.ts#L69
I believe that the error may be due to the fact that Object.keys returns keys in the order in which they appear in the object, and sometimes the PDF loader may build the object in a different way? My suggestion would be to implement an order guarantee such as
Object.keys(documents[i].metadata).sort().forEach((metaKey) => {
Here's the code that has produced the error
I had to remove all metadata to make it work, like so
can you please have a look?