directoryLoader.load() not working?

mayooear / gpt4-pdf-chatbot-langchain

GPT4 & LangChain Chatbot for large PDF docs

https://www.youtube.com/watch?v=ih9PBGVVOO4

14.95k stars 3.02k forks source link

directoryLoader.load() not working? #320

Closed sean-s14 closed 1 year ago

sean-s14 commented 1 year ago

The line below in scripts/ingest-data.ts is returning an empty array.

const docs = await textSplitter.splitDocuments(rawDocs);

I logged rawDocs and it displayed the source and pdf_numpages metadata correctly however the pageContent is just a bunch of new line strings concatenated together like so '\n' + '\n' + '\n' ...

Running npm run ingest returns the following:

[WARN] Importing from 'langchain/document_loaders' is deprecated. Import from eg. 'langchain/document_loaders/fs/text' 
or 'langchain/document_loaders/web/cheerio' instead.
See https://js.langchain.com/docs/getting-started/install#updating-from-0052 for upgrade instructions.
(node:4904) ExperimentalWarning: The Fetch API is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
split docs []
creating vector store...
ingestion complete

Edit Turns out this is a problem I'm having with just one text book. Unsure what to do.

bookofbash commented 1 year ago

Here's what I did with the same error. Also had to use a different pdf, but it worked.

Link to comment

https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/318#issuecomment-1557525044

EgyptianBrince commented 1 year ago

its just a formatting error because langchain had a new update, replace line 13 with this

export const run = async () => { try { /load raw docs from the all files in the directory / const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new CustomPDFLoader(path, '/pdf'), });

Itll work fine afterwards (remember to save file)

twut commented 1 year ago

Got the same issue here. docs was empty. [WARN] Importing from 'langchain/document_loaders' is deprecated. Import from eg. 'langchain/document_loaders/fs/text' or 'langchain/document_loaders/web/cheerio' instead. See https://js.langchain.com/docs/getting-started/install#updating-from-0052 for upgrade instructions. split docs []

Have tried to add the 2nd argument, '/pdf', but still not working. My langchain version is 0.0.82.

dosubot[bot] commented 1 year ago

Hi, @sean-s14! I'm Dosu, and I'm here to help the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue where the directoryLoader.load() function in scripts/ingest-data.ts is returning an empty array when using textSplitter.splitDocuments(rawDocs). There have been some comments on the issue, with bookofbash suggesting a workaround using a different PDF and EgyptianBrince providing a code snippet to fix a formatting error. Another user, twut, also reported a similar issue but hasn't found a solution yet.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution, and we appreciate your understanding as we manage our backlog. Let us know if you have any further questions or concerns!