khoj-ai / khoj

Your AI second brain. Self-hostable. Get answers from the web or your docs. Build custom agents, schedule automations, do deep research. Turn any online or local LLM into your personal, autonomous AI (e.g gpt, claude, gemini, llama, qwen, mistral).
https://khoj.dev
GNU Affero General Public License v3.0
14.15k stars 704 forks source link

[FIX] Not all files in a folder are being indexed #696

Closed teun95 closed 7 months ago

teun95 commented 7 months ago

Describe the bug

When adding a folder to the desktop application for indexing, only two files are indexed even though there are hundreds of markdown files in the folder.

To Reproduce

Add a folder with markdown files to Khoj desktop and click save. Then inspect the settings of the Khoj server webpage and click on "update" under files to view the incomplete list of files.

Screenshots

image

Platform

If self-hosted

Additional context

PS C:\Users\beetle\AppData\Local\Programs\Khoj> .\Khoj.exe --trace-warnings
PS C:\Users\beetle\AppData\Local\Programs\Khoj>
ElectronStore {
  _deserialize: [Function (anonymous)],
  _serialize: [Function (anonymous)],
  events: EventEmitter {
    _events: [Object: null prototype] {},
    _eventsCount: 0,
    _maxListeners: undefined,
    [Symbol(kCapture)]: false
  },
  path: 'C:\\Users\\beetle\\AppData\\Roaming\\Khoj\\config.json'
}
Add %2220. Reducing waste of defects (Poka Yoke, Standard%22.md in C:\Users\beetle\Documents\knowLodge\pages for indexing
Add %226. relationship agile, scrum, kaizen, lean to six%22.md in C:\Users\beetle\Documents\knowLodge\pages for indexing
(node:11300) ExperimentalWarning: buffer.File is an experimental feature and might change at any time
    at emitExperimentalWarning (node:internal/util:238:11)
    at new File (node:internal/file:40:5)
    at makeEntry (node:internal/deps/undici/undici:4935:44)
    at _FormData.append (node:internal/deps/undici/undici:4827:23)
    at C:\Users\beetle\AppData\Local\Programs\Khoj\resources\app.asar\main.js:201:55
    at Array.forEach (<anonymous>)
    at pushDataToKhoj (C:\Users\beetle\AppData\Local\Programs\Khoj\resources\app.asar\main.js:201:24)
    at Object.<anonymous> (C:\Users\beetle\AppData\Local\Programs\Khoj\resources\app.asar\main.js:238:1)
    at Module._compile (node:internal/modules/cjs/loader:1271:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1326:10)
Add %2220. Reducing waste of defects (Poka Yoke, Standard%22.md in C:\Users\beetle\Documents\knowLodge\pages for indexing
Add %226. relationship agile, scrum, kaizen, lean to six%22.md in C:\Users\beetle\Documents\knowLodge\pages for indexing
Pushing data to Khoj at:  2024-04-11T14:45:14.260Z
I:\cloud\GoogleDrive\PDF
I:\cloud\GoogleDrive\PDF is a directory.
Add %2220. Reducing waste of defects (Poka Yoke, Standard%22.md in C:\Users\beetle\Documents\knowLodge\pages for indexing
Add %226. relationship agile, scrum, kaizen, lean to six%22.md in C:\Users\beetle\Documents\knowLodge\pages for indexing
server-1    | [14:45:51.051996] INFO     📬 Updating content index via API call  indexer.py:68
server-1    |                            by desktop client
server-1    | [14:45:51.053964] INFO     💎 Setting up search for markdown      indexer.py:210
server-1    |                            notes
server-1    | [14:45:51.055291] DEBUG    Converted 2 markdown       markdown_to_entries.py:127
server-1    |                            entries to dictionaries
server-1    | [14:45:51.056420] DEBUG    Parse entries from Markdown files into helpers.py:157
server-1    |                            dictionaries: 0.001 seconds
server-1    | [14:45:51.057441] DEBUG    Split entries by max token size        helpers.py:157
server-1    |                            supported by model: 0.000 seconds
Hashing Entries: 100%|██████████| 2/2 [00:00<00:00, 64527.75it/s]
server-1    | [14:45:51.058711] DEBUG    Constructed current entry hashes in:   helpers.py:157
server-1    |                            0.000 seconds
server-1    | [14:45:51.059680] DEBUG    Deleting all entries for file  text_to_entries.py:105
server-1    |                            type markdown
server-1    | [14:45:51.118946] DEBUG    Cleared existing dataset for           helpers.py:157
server-1    |                            regeneration in: 0.059 seconds
Identify new entries: 100%|██████████| 2/2 [00:00<00:00, 1040.00it/s]
server-1    | [14:45:51.122545] DEBUG    Identified entries to add to database  helpers.py:157
server-1    |                            in: 0.002 seconds
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.05it/s]
server-1    | [14:45:51.227608] DEBUG    Generated embeddings for entries to    helpers.py:157
server-1    |                            add to database in: 0.104 seconds
Add entries to database: 1it [00:00, 25.34it/s]
server-1    | [14:45:51.268867] DEBUG    Added 2 markdown entries to    text_to_entries.py:158
server-1    |                            database
server-1    | [14:45:51.270265] DEBUG    Added entries to database in: 0.041    helpers.py:157
server-1    |                            seconds
server-1    | [14:45:51.271363] DEBUG    Indexed 0 dates from added     text_to_entries.py:170
server-1    |                            markdown entries
server-1    | [14:45:51.272295] DEBUG    Indexed dates from added entries in:   helpers.py:157
server-1    |                            0.001 seconds
server-1    | [14:45:51.275642] DEBUG    Deleted entries identified by server   helpers.py:157
server-1    |                            from database in: 0.002 seconds
server-1    | [14:45:51.276720] DEBUG    Deleted entries requested by clients   helpers.py:157
server-1    |                            from database in: 0.000 seconds
server-1    | [14:45:51.277717] DEBUG    Identify new or updated entries: 0.219 helpers.py:157
server-1    |                            seconds
server-1    | [14:45:51.278926] INFO     Deleted 2 entries. Created 2 new   text_search.py:218
server-1    |                            entries for user default from
server-1    |                            files ['%2220. Reducing waste of
server-1    |                            defects (Poka Yoke,
server-1    |                            Standard%22.md', '%226.
server-1    |                            relationship agile, scrum, kaizen,
server-1    |                            lean to six%22.md'] ...
sabaimran commented 7 months ago

Hi! We also realised this bug was popping up in the directory indexing flow. @debanjum has a fix on the way which we'll release either today or tomorrow.

debanjum commented 7 months ago

Thanks for raising an issue for this @teun95 , I've just pushed a new release with the fix for it in https://github.com/khoj-ai/khoj/commit/f040418cf1a795de5ed6929376c90009a45d429f. Please upgrade your Khoj desktop app to the latest version 1.10.x and see if you can index all the markdown files in your folder?

Edit: It seems like there was an issue from the desktop build that's preventing it from being download. Working on debugging this!

teun95 commented 7 months ago

1.10.2 fixed the issue. Thanks!