khoj-ai / khoj

Your AI second brain. Self-hostable. Get answers from the internet or your docs. Use any online or local LLM (e.g gpt, claude, gemini, llama, qwen, mistral). Build custom agents, personalized automations.
https://khoj.dev
GNU Affero General Public License v3.0
12.84k stars 657 forks source link

Can't add pdfs to library. #493

Closed spott closed 1 year ago

spott commented 1 year ago

When I add a folder of pdfs to the library, it isn't indexed, even after forcing a reindex. When restarting the app, it looks for the pdf.jsonl.gzip file and doesn't find it.

(it does reindex GitHub and Obsidian files).

I've spent some time looking around for the culprit, but I wasn't able to figure it out. My best guess is that it isn't being triggered in routers.indexer.configure_content. The "Search Type" appears to be (when I go to the search page on the webpage) lacking the pdf search type, which might be what is causing the problem, though I'm not sure why.

sabaimran commented 1 year ago

Hi @spott, thanks for reporting the issue. That sounds fairly frustrating. When you look at your PDF configuration, do you see anything there?

It won't find pdf.jsonl.gzip if it failed to index. The reason you don't see it in the search type drop down in the same -- because it didn't manage to configure any content for that data type.

You're able to search your PDF and Obsidian/markdown data effectively? Do you have any PDFs in your Obsidian vault?

sabaimran commented 1 year ago

It would be helpful if you could give a step by step breakdown of the actions/configurations you did so I can try to reproduce. For example, maybe it was something like this?

  1. Install Khoj pip install khoj-assistant
  2. Start khoj via khoj
  3. Install Obsidian plugin
  4. Link Khoj backend with Obsidian client with vault containing markdown, pdf files
  5. Manually configure PDF directory with users/spott/papers
  6. Force re-indexing via Khoj web UI

Or any equivalent steps/screenshots to reproduce the error. Thanks in advance!

spott commented 1 year ago

I'm able to search my obsidian vault without a problem. My steps are likely something like what you said, with the "add the GitHub connector" somewhere after 4 and before 5.

Unfortunately, I haven't been able to get any pdfs to be indexed at all

When I add some logger messages here:

        logger.info("attempting to initialize pdf search")
        logger.info(f"{content_config=}")
        logger.info(f"{files=}")

And press the "reinitialize" button, I get:

           INFO     attempting to initialize pdf search                                                                                                                           indexer.py:266
           INFO     content_config=ContentConfig(org=None, image=None,                                                                                                            indexer.py:267
                    markdown=TextContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/markdown/_Users_spott_ObsidianNotes_Personal.jsonl.gz'),                                  
                    embeddings_file=PosixPath('/Users/spott/.khoj/content/markdown/_Users_spott_ObsidianNotes_Personal.pt'), input_files=None,                                                  
                    input_filter=['/Users/spott/ObsidianNotes/Personal/**/*.md'], index_heading_entries=False),                                                                                 
                    pdf=TextContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/pdf/pdf.jsonl.gz'),                                                                            
                    embeddings_file=PosixPath('~/.khoj/content/pdf/pdf_embeddings.pt'), input_files=None, input_filter=['~/Zotero/storage/**/*.pdf**/*.pdf'],                                   
                    index_heading_entries=False), plaintext=None, github=GithubContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/github/github.jsonl.gz'),                   
                    embeddings_file=PosixPath('/Users/spott/.khoj/content/github/github_embeddings.pt'), pat_token='<redacted>',                                  
                    repos=[GithubRepoConfig(name='dotfiles', owner='spott', branch='main'), GithubRepoConfig(name='iac', owner='spott', branch='main')]), plugins=None,                         
                    notion=None)
           INFO     files={'org': {}, 'markdown': { ... }, }, 'plaintext': {}, 'pdf': {}}

So the 'pdf' part of files isn't filled.

When I add some logging to this function in fs_syncer, I get:

           INFO     getting pdf files config=TextContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/pdf/pdf.jsonl.gz'),                                       fs_syncer.py:183
                    embeddings_file=PosixPath('~/.khoj/content/pdf/pdf_embeddings.pt'), input_files=None, input_filter=['~/Zotero/storage/**/*.pdf**/*.pdf'],                                   
                    index_heading_entries=False)                                                                                                                                                
           INFO     getting pdf files pdf_files=None                                                                                                                            fs_syncer.py:184
           INFO     getting pdf files pdf_file_filter=['~/Zotero/storage/**/*.pdf**/*.pdf']                                                                                     fs_syncer.py:185
           INFO     files: all_pdf_files=[]               

The last line is placed here.

So it looks like the "pdf_file_filter" is getting the **/*.pdf added to the end, after I have done a **/*.pdf... when I change my input to ~/Zotero/storage/ (without the **/*.pdf, it looks like things are indexing now.

sabaimran commented 1 year ago

Ah yeah, somehow that file path '~/Zotero/storage/**/*.pdf**/*.pdf' got a bit wonky. Did you configure that using the web UI? It looks like you added the *.pdf on your own, and then the client side code did the same thing. We should probably check first if the wildcards are already populated before updating the filepath in the config.

sabaimran commented 1 year ago

Really appreciate the helpful debugging and investigation, @spott !

If you're interested in contributing, especially since you've done a lot of the hard work in diagnosing the issue, feel free to let me know! Else, I can do it.

The fix would be in content_type_input.html, to check whether the globFormat (**/*.) is already present in the file path before updating the inputFilter path in this line: inputFilter.push(nodes[i].value + globFormat + suffixes[j]);.

https://github.com/khoj-ai/khoj/blob/f6f7a62d8076580e8794b18cee20ba86dd95a0e6/src/khoj/interface/web/content_type_input.html#L128-L132

spott commented 1 year ago

I actually was going to argue that here:

https://github.com/khoj-ai/khoj/blob/f6f7a62d8076580e8794b18cee20ba86dd95a0e6/src/khoj/interface/web/content_type_input.html#L37C143-L37C161 . (I can't figure out how to show this inline... how did you do that?)

You should drop the input_filter.split('/*')[0]. That way it is clear exactly what the glob is that is being used (and if someone adds **/*.pdf like I did, the next time they visit it, it will be clear. Both options probably makes the most sense.

I'll submit a pull request.

agzam commented 1 year ago

I also struggled with adding pdfs and couldn't figure it out, it wouldn't index, even though .org files worked from the get-go. What I eventually ended up doing is I set input-filter param in khoj.yml for pdfs similarly how it sets the path for org files. Then deleted everything in the folder except khoj.yml file. Restarted and forced it to re-initialize.