Closed sean-s14 closed 1 year ago
Here's what I did with the same error. Also had to use a different pdf, but it worked.
Link to comment
https://github.com/mayooear/gpt4-pdf-chatbot-langchain/issues/318#issuecomment-1557525044
its just a formatting error because langchain had a new update, replace line 13 with this
export const run = async () => { try { /load raw docs from the all files in the directory / const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new CustomPDFLoader(path, '/pdf'), });
Itll work fine afterwards (remember to save file)
Got the same issue here.
docs was empty.
[WARN] Importing from 'langchain/document_loaders' is deprecated. Import from eg. 'langchain/document_loaders/fs/text' or 'langchain/document_loaders/web/cheerio' instead. See https://js.langchain.com/docs/getting-started/install#updating-from-0052 for upgrade instructions. split docs []
Have tried to add the 2nd argument, '/pdf', but still not working. My langchain version is 0.0.82.
Hi, @sean-s14! I'm Dosu, and I'm here to help the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you reported an issue where the directoryLoader.load()
function in scripts/ingest-data.ts
is returning an empty array when using textSplitter.splitDocuments(rawDocs)
. There have been some comments on the issue, with bookofbash suggesting a workaround using a different PDF and EgyptianBrince providing a code snippet to fix a formatting error. Another user, twut, also reported a similar issue but hasn't found a solution yet.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution, and we appreciate your understanding as we manage our backlog. Let us know if you have any further questions or concerns!
The line below in
scripts/ingest-data.ts
is returning an empty array.I logged
rawDocs
and it displayed thesource
andpdf_numpages
metadata correctly however thepageContent
is just a bunch of new line strings concatenated together like so'\n' + '\n' + '\n' ...
Running
npm run ingest
returns the following:Edit Turns out this is a problem I'm having with just one text book. Unsure what to do.