reorproject / reor

Private & local AI personal knowledge management app.
https://reorproject.org
GNU Affero General Public License v3.0
6.89k stars 421 forks source link

Support more data formats/integration with other PKMs #276

Open zimengzhou1 opened 3 months ago

zimengzhou1 commented 3 months ago

What are our thoughts on supporting other document types other than markdown, for example PDF or plaintext? Also it would be nice if users could directly use other note taking apps like notion as a source of their data, it would provide a lower barrier to entry to using reor.

samlhuillier commented 3 months ago

Absolutely! This is something that would be great to add, particularly supporting plain text & supporting pdfs. Would you be keen to add this?

zimengzhou1 commented 3 months ago

Sure, I'll have a crack at it!

zimengzhou1 commented 3 months ago

I did a bit of poking around and it seems supporting pdfs is significantly harder than I thought, more so after reading this article.

I tried using the library "pdf-parse" to extract text from the pdfs, but after testing out the parser with several research papers, it was clear the chunks stored in the vector database and shown in the "Related Notes" were poorly formatted (especially equations and tables) and not very relevant.

Supporting plain text was pretty trivial, I just added ".txt" as an allowed extension.