langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.16k stars 2.05k forks source link

Notion vectorStore integration breaks when importing a database with an empty row #2807

Closed natalie-bb closed 10 months ago

natalie-bb commented 10 months ago

file:///Users/natalie/chatbot/node_modules/langchain/dist/embeddings/openai.js:139 const batches = chunkArray(this.stripNewLines ? texts.map((t) => t.replace(/\n/g, " ")) : texts, this.batchSize); ^ TypeError: Cannot read properties of undefined (reading 'replace') at file:///Users/natalie/chatbot/node_modules/langchain/dist/embeddings/openai.js:139:76 at Array.map () at OpenAIEmbeddings.embedDocuments (file:///Users/natalie/chatbot/node_modules/langchain/dist/embeddings/openai.js:139:63) at HNSWLib.addDocuments (file:///Users/natalie/chatbot/node_modules/langchain/dist/vectorstores/hnswlib.js:48:54) at Function.fromDocuments (file:///Users/natalie/chatbot/node_modules/langchain/dist/vectorstores/hnswlib.js:258:24) at file:///Users/natalie/chatbot/tools/vectorStores/restaurants.ts:27:46 at processTicksAndRejections (node:internal/process/task_queues:95:5)

dosubot[bot] commented 10 months ago

🤖

Hello,

Thank you for bringing this issue to our attention. It seems like the error is occurring because the embedDocuments function in openai.js is trying to call the replace method on an undefined value. This is likely happening because there's an empty row in the database you're trying to import, which results in an undefined value in the texts array that's passed to embedDocuments.

Here's the relevant part of the code:

const batches = chunkArray(this.stripNewLines ? texts.map((t) => t.replace(/\n/g, " ")) : texts, this.batchSize);

In this line, texts.map((t) => t.replace(/\n/g, " ")) is trying to replace all newline characters in each string in the texts array. If there's an undefined value in the texts array (which would be the case if there's an empty row in the database), calling replace on it will throw a TypeError.

To fix this issue, we could add a check to ensure that t is defined before calling replace on it. Here's how you could modify the code:

const batches = chunkArray(this.stripNewLines ? texts.map((t) => t ? t.replace(/\n/g, " ") : "") : texts, this.batchSize);

In this modified code, t ? t.replace(/\n/g, " ") : "" will return an empty string if t is undefined, which should prevent the TypeError from being thrown.

Please try this modification and let us know if it resolves the issue. If it doesn't, or if you have any other questions, please don't hesitate to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

jacoblee93 commented 10 months ago

@skarard can you have a look at this one?

skarard commented 10 months ago

Sure.

skarard commented 10 months ago

I was able to reproduce the bug and it opens a pretty solid question.

First the bug:

/**
 * Interface for interacting with a document.
 */
export class Document<
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  Metadata extends Record<string, any> = Record<string, any>
> implements DocumentInput
{
  pageContent: string;

  metadata: Metadata;

  constructor(fields: DocumentInput<Metadata>) {
    this.pageContent = fields.pageContent
      ? fields.pageContent.toString()
      : this.pageContent;
    this.metadata = fields.metadata ?? ({} as Metadata);
  }
}

The constructor for the Document class will set pageContent to undefined if it's falsy (e.g. an empty string).

A Notion page contains properties that the NotionAPILoader store as metadata. Often in a Notion database, a page will not have any contents and the pages properties are where the data is stored.

@natalie-bb a work around to get the script working for now is to replace the undefined pageContent with an empty string.

const workaroundDocs = docs.map(
  (doc) => ({ ...doc, pageContent: doc.pageContent ?? "" } as Document)
);

Your code may want to look something like this...

const loader = new NotionAPILoader({...});

const docs = await loader.load();

const workaroundDocs = docs.map(
  (doc) => ({ ...doc, pageContent: doc.pageContent ?? "" } as Document)
);

const vectorStore = await HNSWLib.fromDocuments(workaroundDocs, embeddings);

@jacoblee93 Should the Document class store an empty string if pageContent is an empty string?

jacoblee93 commented 10 months ago

IMO no - it would never get retrieved

skarard commented 10 months ago

@natalie-bb I'm going to add a little more context which might not be obvious.

The database properties are not used when running the embeddings, this is to say the only searchable data is pageContent, the properties can be used to filter or do a database query (if your vector store supports that).

Something that may be closer to having similar searchable Notion properties is to add the properties as a header to pageContent. This is an example following front matter content management.

  const workaroundDocs = docs.map(
    (doc) =>
      ({
        ...doc,
        pageContent: `---\n${yaml.dump(doc.metadata.properties)}---\n${doc.pageContent ?? ""}`,
      } as Document)
  );

There is a yaml parsing/writing library already in Langchain, this is the import you'd need.

import yaml from "js-yaml";
skarard commented 10 months ago

IMO no - it would never get retrieved

@jacoblee93 Ok, if I understand you correctly, this would mean that HNSWLib will need to support pageContent as undefined?

natalie-bb commented 10 months ago

As a workaround, I simply deleted the empty row.

The error was somewhat obtuse, and it wasn't immediately obvious to me that I needed to do that.

jacoblee93 commented 10 months ago

IMO no - it would never get retrieved

@jacoblee93 Ok, if I understand you correctly, this would mean that HNSWLib will need to support pageContent as undefined?

What I mean is it shouldn't be created at all - pageContent is a required field:

https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/document.ts#L18

skarard commented 10 months ago

IMO no - it would never get retrieved

@jacoblee93 Ok, if I understand you correctly, this would mean that HNSWLib will need to support pageContent as undefined?

What I mean is it shouldn't be created at all - pageContent is a required field:

https://github.com/langchain-ai/langchainjs/blob/main/langchain/src/document.ts#L18

Thank you for your clarification.

The typing for the Document class currently suggests that an empty string is a valid pageContents but sets this to undefined breaking the fields typing.

Therefore the correct behaviour is for the Document class constructor is to throw an error when page contents are string.length == 0. I'll throw in a new PR for this, you're welcome to close it if you hate the idea.

I'll also update the NotionAPILoader to check if pageContents is falsy, so that a page would not be created. I'm also going to throw in a new NotionAPILoaderOption to add the page properties as front matter to page contents (yaml). The default will be the current behaviour, to not prepend anything. I'll link the issue to this PR.