reorproject / reor

Private & local AI personal knowledge management app.
https://reorproject.org
GNU Affero General Public License v3.0
6.76k stars 398 forks source link

Import external context through scrapping, pdf loading and other tools. #61

Open ElCuboNegro opened 6 months ago

ElCuboNegro commented 6 months ago

It might be really useful to directly add things to the context collection through URLs or directly passing files (pdf's) into the vault. Maybe I can help with that shenanigans.

ElCuboNegro commented 6 months ago

https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced_astradb.ipynb

ElCuboNegro commented 6 months ago

https://blog.llamaindex.ai/introducing-llamacloud-and-llamaparse-af8cedf9006b

samlhuillier commented 6 months ago

Thank you for opening this @ElCuboNegro Yes I agree that this is definitely something that would be super good to add. Do you think you'd want to work on a PR for this?

ElCuboNegro commented 6 months ago

Yes

El 24 feb 2024 7:07 a. m., samlhuillier @.***> escribió:

Thank you for opening this @ElCuboNegrohttps://github.com/ElCuboNegro Yes I agree that this is definitely something that would be super good to add. Do you think you'd want to work on a PR for this?

— Reply to this email directly, view it on GitHubhttps://github.com/reorproject/reor/issues/61#issuecomment-1962341249, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABKEV62DR7VNJZBVAIRSF5LYVHJWFAVCNFSM6AAAAABDOJDEH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGM2DCMRUHE. You are receiving this because you were mentioned.Message ID: @.***>

ElCuboNegro commented 6 months ago

Do you have any documentation about how are you doing the RAG?

With the aim of implementing something like this https://ai.plainenglish.io/unlocking-whole-dataset-reasoning-why-knowledge-graphs-are-the-future-of-ai-systems-fc8726367808

Haze-sh commented 4 months ago

Any updates on this? I would like to help for a PR.

samlhuillier commented 4 months ago

That would be great @Haze-sh! Which part specifically do you want to work on? Loading PDFs or indexing content from the web?

Haze-sh commented 4 months ago

I was thinking about the PDFs loading, do you suggest a starting point?

samlhuillier commented 4 months ago

To load in PDFs, I guess we'd probably want to build out a couple of separate features in stages:

  1. Be able to read PDFs. Currently, there is a list called markdownExtensions which is used to limit the only files Reor reads to those with markdown extensions. The first thing we'd probably do is add the pdf extension to this. This list of extensions is used to generate the FileInfoList which is essentially a tree representation of the metadata of each file Reor uses as context.
  2. Then, we'd probably want to update our readFile and read-file ipc handler functions to have two calls: one for indexing which will use a library like pdf-parse to read the actual text content of the file so that we can index it in the vector database. The ipc handler will probably want to read the pdf file in base64 so that it can be returned to the renderer process and renderered. (Bear in mind that both these calls will basically be an if statement to check the special case for PDF files)
  3. Now the indexing should work fine. The next step is probably working on rendering PDF files in the editor. This will involve modifying the openFileByPath function in use-file-by-filepath.ts and adding in custom logic for the case where the file extension is a pdf file. The line: editor?.commands.setContent(fileContent); will not need to be run as it is setting the content for our current TipTap editor which probably won't work with PDF content.
  4. Final extension step which I think would be nice (but is slighly unrelated to this PR) is to handle drag and drop any file into Reor

Let me know if you have any other questions! Very happy to help :)

xuquankun commented 5 days ago

Hello, is there any progress on loading PDF?