langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.68k stars 2.18k forks source link

Unsupported Unicode escape sequence 400 Bad Request #4340

Closed nosisky closed 5 months ago

nosisky commented 9 months ago

I am using Langchain along with Supabase for vector storage, this has been working well until a particular PDF was uploaded and I keep getting the below error:

Error: Error inserting: unsupported Unicode escape sequence 400 Bad Request

The full error is below:

Error: Error inserting: unsupported Unicode escape sequence 400 Bad Request
    at SupabaseVectorStore.addVectors (webpack-internal:///(rsc)/./node_modules/@langchain/community/dist/vectorstores/supabase.js:92:23)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at Function.fromDocuments (webpack-internal:///(rsc)/./node_modules/@langchain/community/dist/vectorstores/supabase.js:202:9)
    at POST (webpack-internal:///(rsc)/./src/app/api/study-analysis/route.ts:53:24)

My code looks like below:

    const docs = await new PDFLoader(file).load();
  const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });

    const splittedDocs = await textSplitter.splitDocuments(docs);

const vectorStore = await SupabaseVectorStore.fromDocuments(
        splittedDocument,
        embeddings,
        {
        // Configuration
        }
      );
dosubot[bot] commented 9 months ago

🤖

Hey @nosisky, great to see you back! Hope you're doing well.

The error you're encountering, "Error inserting: unsupported Unicode escape sequence 400 Bad Request", is likely due to the content of the PDF you're trying to upload. It seems that it contains unsupported Unicode escape sequences that the Supabase database is unable to process.

In the addVectors method of the SupabaseVectorStore class, the content of the documents is directly inserted into the Supabase database. If the pageContent of any document contains unsupported Unicode escape sequences, the upsert operation will fail and throw the error you're seeing.

Here's the relevant code from the SupabaseVectorStore class:

const rows = vectors.map((embedding, idx) => ({
  content: documents[idx].pageContent,
  embedding,
  metadata: documents[idx].metadata,
}));

// ...

const res = await this.client.from(this.tableName).upsert(chunk).select();
if (res.error) {
  throw new Error(
    `Error inserting: ${res.error.message} ${res.status} ${res.statusText}`
  );
}

To resolve this issue, you need to sanitize the content of the PDF before trying to upload it. This could involve removing or replacing any unsupported Unicode escape sequences. The exact method for doing this will depend on the specific content of your PDF and the requirements of your application.

You can find the relevant code in the SupabaseVectorStore file in the langchainjs repository.

As for the specific Unicode escape sequences that are unsupported by the Supabase database, I wasn't able to find this information within the langchainjs repository. It might be worth checking the Supabase documentation or reaching out to their support for more details.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.