mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs
https://www.youtube.com/watch?v=ih9PBGVVOO4
14.74k stars 3k forks source link

Upload embeds to Pinecone but shown empty #373

Closed taozangyearone closed 8 months ago

taozangyearone commented 12 months ago

Hi, I successfully subdivided my document into chunks and uploaded the embed to Pinecone. But when I tried to fetch the embed it is shown empty. I think I set up the correct Pinecone Index Name and API. My document is 11 vectors and I gave a namespace called "abc". After running the ingest-data.ts, my Pinecone server will show the namespace with 11 vectors stored. I have almost exactly same issue with this post: https://community.pinecone.io/t/vectors-are-sent-to-pinecone-but-seem-to-arrive-empty/1113, but looks like no solution is provided. Appreciate any help or insight!

image

---Code below--- import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'; import { OpenAIEmbeddings } from 'langchain/embeddings/openai'; import { PineconeStore } from 'langchain/vectorstores/pinecone'; import { pinecone } from '@/utils/pinecone-client'; import { PDFLoader } from 'langchain/document_loaders/fs/pdf'; import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone'; import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';

/ Name of directory to retrieve your files from Make sure to add your PDF files inside the 'docs' folder / const filePath = 'docs';

export const run = async () => { try { /load raw docs from the all files in the directory / const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new PDFLoader(path), });

// const loader = new PDFLoader(filePath);
const rawDocs = await directoryLoader.load();

/* Split text into chunks */
const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000, // size?
  chunkOverlap: 200,
});

const docs = await textSplitter.splitDocuments(rawDocs);

console.log('the number of vector',docs.length) // size ok. 

console.log('creating vector store...'); // this line is not executed until all await() done.
/*create and store the embeddings in the vectorStore*/
const embeddings = new OpenAIEmbeddings(); // create an embedding object
// const embed1 = await embeddings.embedDocuments([docs[0].pageContent, docs[1].pageContent])
// console.log(embed1)
console.log("---")
console.log('example chunk',docs[0])

console.log("---")
const index = pinecone.Index('chatgpt'); //change to your own index name (e.g. chatgpt)
console.log("check pine cone index,", PINECONE_INDEX_NAME)
// console.log(index)
// embed the PDF documents

await PineconeStore.fromDocuments(docs, embeddings, {
  pineconeIndex: index,
  namespace: 'abc', // define namespace here & will update on Pinecone
  // textKey: 'pageContent', // text key??
});

} catch (error) { console.log('error', error); throw new Error('Failed to ingest your data'); } };

(async () => { await run(); console.log('ingestion complete'); // console.log(PINECONE_INDEX_NAME); // console.log(PINECONE_NAME_SPACE); // console.log(PineconeStore) })();

dosubot[bot] commented 8 months ago

Hi, @taozangyearone! I'm Dosu, and I'm here to help the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue where you were able to upload your document embeddings to Pinecone, but when you tried to fetch the embeddings, they were empty. There haven't been any further activities or comments on this issue since then.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding, and we appreciate your contribution to the gpt4-pdf-chatbot-langchain project!