Closed spott closed 1 year ago
Hi @spott, thanks for reporting the issue. That sounds fairly frustrating. When you look at your PDF configuration, do you see anything there?
It won't find pdf.jsonl.gzip
if it failed to index. The reason you don't see it in the search type drop down in the same -- because it didn't manage to configure any content for that data type.
You're able to search your PDF and Obsidian/markdown data effectively? Do you have any PDFs in your Obsidian vault?
It would be helpful if you could give a step by step breakdown of the actions/configurations you did so I can try to reproduce. For example, maybe it was something like this?
pip install khoj-assistant
khoj
users/spott/papers
Or any equivalent steps/screenshots to reproduce the error. Thanks in advance!
I'm able to search my obsidian vault without a problem. My steps are likely something like what you said, with the "add the GitHub connector" somewhere after 4 and before 5.
Unfortunately, I haven't been able to get any pdfs to be indexed at all
When I add some logger messages here:
logger.info("attempting to initialize pdf search")
logger.info(f"{content_config=}")
logger.info(f"{files=}")
And press the "reinitialize" button, I get:
INFO attempting to initialize pdf search indexer.py:266
INFO content_config=ContentConfig(org=None, image=None, indexer.py:267
markdown=TextContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/markdown/_Users_spott_ObsidianNotes_Personal.jsonl.gz'),
embeddings_file=PosixPath('/Users/spott/.khoj/content/markdown/_Users_spott_ObsidianNotes_Personal.pt'), input_files=None,
input_filter=['/Users/spott/ObsidianNotes/Personal/**/*.md'], index_heading_entries=False),
pdf=TextContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/pdf/pdf.jsonl.gz'),
embeddings_file=PosixPath('~/.khoj/content/pdf/pdf_embeddings.pt'), input_files=None, input_filter=['~/Zotero/storage/**/*.pdf**/*.pdf'],
index_heading_entries=False), plaintext=None, github=GithubContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/github/github.jsonl.gz'),
embeddings_file=PosixPath('/Users/spott/.khoj/content/github/github_embeddings.pt'), pat_token='<redacted>',
repos=[GithubRepoConfig(name='dotfiles', owner='spott', branch='main'), GithubRepoConfig(name='iac', owner='spott', branch='main')]), plugins=None,
notion=None)
INFO files={'org': {}, 'markdown': { ... }, }, 'plaintext': {}, 'pdf': {}}
So the 'pdf' part of files isn't filled.
When I add some logging to this function in fs_syncer, I get:
INFO getting pdf files config=TextContentConfig(compressed_jsonl=PosixPath('/Users/spott/.khoj/content/pdf/pdf.jsonl.gz'), fs_syncer.py:183
embeddings_file=PosixPath('~/.khoj/content/pdf/pdf_embeddings.pt'), input_files=None, input_filter=['~/Zotero/storage/**/*.pdf**/*.pdf'],
index_heading_entries=False)
INFO getting pdf files pdf_files=None fs_syncer.py:184
INFO getting pdf files pdf_file_filter=['~/Zotero/storage/**/*.pdf**/*.pdf'] fs_syncer.py:185
INFO files: all_pdf_files=[]
The last line is placed here.
So it looks like the "pdf_file_filter" is getting the **/*.pdf
added to the end, after I have done a **/*.pdf
... when I change my input to ~/Zotero/storage/
(without the **/*.pdf
, it looks like things are indexing now.
Ah yeah, somehow that file path '~/Zotero/storage/**/*.pdf**/*.pdf'
got a bit wonky. Did you configure that using the web UI? It looks like you added the *.pdf on your own, and then the client side code did the same thing. We should probably check first if the wildcards are already populated before updating the filepath in the config.
Really appreciate the helpful debugging and investigation, @spott !
If you're interested in contributing, especially since you've done a lot of the hard work in diagnosing the issue, feel free to let me know! Else, I can do it.
The fix would be in content_type_input.html
, to check whether the globFormat
(**/*.
) is already present in the file path before updating the inputFilter
path in this line: inputFilter.push(nodes[i].value + globFormat + suffixes[j]);
.
I actually was going to argue that here:
https://github.com/khoj-ai/khoj/blob/f6f7a62d8076580e8794b18cee20ba86dd95a0e6/src/khoj/interface/web/content_type_input.html#L37C143-L37C161 . (I can't figure out how to show this inline... how did you do that?)
You should drop the input_filter.split('/*')[0]
. That way it is clear exactly what the glob is that is being used (and if someone adds **/*.pdf
like I did, the next time they visit it, it will be clear. Both options probably makes the most sense.
I'll submit a pull request.
I also struggled with adding pdfs and couldn't figure it out, it wouldn't index, even though .org files worked from the get-go. What I eventually ended up doing is I set input-filter param in khoj.yml for pdfs similarly how it sets the path for org files. Then deleted everything in the folder except khoj.yml file. Restarted and forced it to re-initialize.
When I add a folder of pdfs to the library, it isn't indexed, even after forcing a reindex. When restarting the app, it looks for the pdf.jsonl.gzip file and doesn't find it.
(it does reindex GitHub and Obsidian files).
I've spent some time looking around for the culprit, but I wasn't able to figure it out. My best guess is that it isn't being triggered in
routers.indexer.configure_content
. The "Search Type" appears to be (when I go to the search page on the webpage) lacking thepdf
search type, which might be what is causing the problem, though I'm not sure why.