khoj-ai / khoj

Your AI second brain. Get answers to your questions, whether they be online or in your own notes. Use online AI models (e.g gpt4) or private, local LLMs (e.g llama3). Self-host locally or use our cloud instance. Access from Obsidian, Emacs, Desktop app, Web or Whatsapp.
https://khoj.dev
GNU Affero General Public License v3.0
12.47k stars 635 forks source link

1000-file limit? #573

Closed sobjornstad closed 8 months ago

sobjornstad commented 9 months ago

I tried to add my knowledge base (composed of text files) to Khoj to try it out, but I get an ECONNRESET error when I try to save the folder and upload the files:

data: { detail: 'Too many files. Maximum number of files is 1000.' }

I have about 4,200 small files in my knowledge base, which doesn't seem like a particularly unreasonable number to me.

I tried to install the self-hosted version assuming this was a limitation of the free plan, but this doesn't appear to be the case, as it doesn't work there either. I can't find reference to this error message in the code. What is this limit about? Is it something imposed by Axios?

sobjornstad commented 9 months ago

Also, I just removed all files from the settings, restarted Khoj, and tried to upload a folder that contains exactly 451 files, and I'm getting the same error about there being more than 1,000 files. :confused: A folder with 150 files in it worked, though.

/home/soren/cabinet/Me/Records/zettelkasten/zk-wiki/output/textual_tiddlers/content-3 is a directory.
Pushing data to Khoj at:  2023-11-28T19:33:17.223Z
AxiosError: Request failed with status code 400
    at settle (/tmp/.mount_Khoj-1Ir6nY1/resources/app.asar/node_modules/axios/dist/node/axios.cjs:1967:12)
    at IncomingMessage.handleStreamEnd (/tmp/.mount_Khoj-1Ir6nY1/resources/app.asar/node_modules/axios/dist/node/axios.cjs:3062:11)
    at IncomingMessage.emit (node:events:525:35)
    at endReadableNT (node:internal/streams/readable:1359:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  code: 'ERR_BAD_REQUEST',
  config: {
    transitional: {
      silentJSONParsing: true,
      forcedJSONParsing: true,
      clarifyTimeoutError: false
    },
    adapter: [ 'xhr', 'http' ],
    transformRequest: [ [Function: transformRequest] ],
    transformResponse: [ [Function: transformResponse] ],
    timeout: 0,
    xsrfCookieName: 'XSRF-TOKEN',
    xsrfHeaderName: 'X-XSRF-TOKEN',
    maxContentLength: -1,
    maxBodyLength: -1,
    env: { FormData: [Function], Blob: [class Blob] },
    validateStatus: [Function: validateStatus],
    headers: Object [AxiosHeaders] {
      Accept: 'application/json, text/plain, */*',
      'Content-Type': 'multipart/form-data; boundary=axios-1.6.2-boundary-y0YjPn_vkwvE7iR_rqXweGNFY',
      Authorization: 'Bearer <REDACTED>',
      'User-Agent': 'axios/1.6.2',
      'Content-Length': '2597569',
      'Accept-Encoding': 'gzip, compress, deflate, br'
    },
    method: 'post',
    url: 'http://127.0.0.1:42110/api/v1/index/update?force=true&client=desktop',
    data: FormData { [Symbol(state)]: [Array] }
  },
  request: <ref *1> ClientRequest {
    _events: [Object: null prototype] {
      abort: [Function (anonymous)],
      aborted: [Function (anonymous)],
      connect: [Function (anonymous)],
      error: [Function (anonymous)],
      socket: [Function (anonymous)],
      timeout: [Function (anonymous)],
      finish: [Function: requestOnFinish]
    },
    _eventsCount: 7,
    _maxListeners: undefined,
    outputData: [],
    outputSize: 0,
    writable: true,
    destroyed: false,
    _last: true,
    chunkedEncoding: false,
    shouldKeepAlive: false,
    maxRequestsOnConnectionReached: false,
    _defaultKeepAlive: true,
    useChunkedEncodingByDefault: true,
    sendDate: false,
    _removedConnection: false,
    _removedContLen: false,
    _removedTE: false,
    strictContentLength: false,
    _contentLength: '2597569',
    _hasBody: true,
    _trailer: '',
    finished: true,
    _headerSent: true,
    _closed: false,
    socket: Socket {
      connecting: false,
      _hadError: false,
      _parent: null,
      _host: null,
      _closeAfterHandlingError: false,
      _readableState: [ReadableState],
      _events: [Object: null prototype],
      _eventsCount: 7,
      _maxListeners: undefined,
      _writableState: [WritableState],
      allowHalfOpen: false,
      _sockname: null,
      _pendingData: null,
      _pendingEncoding: '',
      server: null,
      _server: null,
      parser: null,
      _httpMessage: [Circular *1],
      [Symbol(async_id_symbol)]: 1463,
      [Symbol(kHandle)]: [TCP],
      [Symbol(lastWriteQueueSize)]: 0,
      [Symbol(timeout)]: null,
      [Symbol(kBuffer)]: null,
      [Symbol(kBufferCb)]: null,
      [Symbol(kBufferGen)]: null,
      [Symbol(kCapture)]: false,
      [Symbol(kSetNoDelay)]: true,
      [Symbol(kSetKeepAlive)]: true,
      [Symbol(kSetKeepAliveInitialDelay)]: 60,
      [Symbol(kBytesRead)]: 0,
      [Symbol(kBytesWritten)]: 0
    },
    _header: 'POST /api/v1/index/update?force=true&client=desktop HTTP/1.1\r\n' +
      'Accept: application/json, text/plain, */*\r\n' +
      'Content-Type: multipart/form-data; boundary=axios-1.6.2-boundary-y0YjPn_vkwvE7iR_rqXweGNFY\r\n' +
      'Authorization: Bearer <REDACTED>\r\n' +
      'User-Agent: axios/1.6.2\r\n' +
      'Content-Length: 2597569\r\n' +
      'Accept-Encoding: gzip, compress, deflate, br\r\n' +
      'Host: 127.0.0.1:42110\r\n' +
      'Connection: close\r\n' +
      '\r\n',
    _keepAliveTimeout: 0,
    _onPendingData: [Function: nop],
    agent: Agent {
      _events: [Object: null prototype],
      _eventsCount: 2,
      _maxListeners: undefined,
      defaultPort: 80,
      protocol: 'http:',
      options: [Object: null prototype],
      requests: [Object: null prototype] {},
      sockets: [Object: null prototype],
      freeSockets: [Object: null prototype] {},
      keepAliveMsecs: 1000,
      keepAlive: false,
      maxSockets: Infinity,
      maxFreeSockets: 256,
      scheduling: 'lifo',
      maxTotalSockets: Infinity,
      totalSocketCount: 1,
      [Symbol(kCapture)]: false
    },
    socketPath: undefined,
    method: 'POST',
    maxHeaderSize: undefined,
    insecureHTTPParser: undefined,
    joinDuplicateHeaders: undefined,
    path: '/api/v1/index/update?force=true&client=desktop',
    _ended: true,
    res: IncomingMessage {
      _readableState: [ReadableState],
      _events: [Object: null prototype],
      _eventsCount: 4,
      _maxListeners: undefined,
      socket: [Socket],
      httpVersionMajor: 1,
      httpVersionMinor: 1,
      httpVersion: '1.1',
      complete: true,
      rawHeaders: [Array],
      rawTrailers: [],
      joinDuplicateHeaders: undefined,
      aborted: false,
      upgrade: false,
      url: '',
      method: null,
      statusCode: 400,
      statusMessage: 'Bad Request',
      client: [Socket],
      _consuming: false,
      _dumped: false,
      req: [Circular *1],
      responseUrl: 'http://127.0.0.1:42110/api/v1/index/update?force=true&client=desktop',
      redirects: [],
      [Symbol(kCapture)]: false,
      [Symbol(kHeaders)]: [Object],
      [Symbol(kHeadersCount)]: 10,
      [Symbol(kTrailers)]: null,
      [Symbol(kTrailersCount)]: 0
    },
    aborted: false,
    timeoutCb: null,
    upgradeOrConnect: false,
    parser: null,
    maxHeadersCount: null,
    reusedSocket: false,
    host: '127.0.0.1',
    protocol: 'http:',
    _redirectable: Writable {
      _writableState: [WritableState],
      _events: [Object: null prototype],
      _eventsCount: 6,
      _maxListeners: undefined,
      _options: [Object],
      _ended: true,
      _ending: true,
      _redirectCount: 0,
      _redirects: [],
      _requestBodyLength: 2597569,
      _requestBodyBuffers: [],
      _onNativeResponse: [Function (anonymous)],
      _currentRequest: [Circular *1],
      _currentUrl: 'http://127.0.0.1:42110/api/v1/index/update?force=true&client=desktop',
      [Symbol(kCapture)]: false
    },
    [Symbol(kCapture)]: false,
    [Symbol(kBytesWritten)]: 0,
    [Symbol(kEndCalled)]: true,
    [Symbol(kNeedDrain)]: true,
    [Symbol(corked)]: 0,
    [Symbol(kOutHeaders)]: [Object: null prototype] {
      accept: [Array],
      'content-type': [Array],
      authorization: [Array],
      'user-agent': [Array],
      'content-length': [Array],
      'accept-encoding': [Array],
      host: [Array]
    },
    [Symbol(errored)]: null,
    [Symbol(kUniqueHeaders)]: null
  },
  response: {
    status: 400,
    statusText: 'Bad Request',
    headers: Object [AxiosHeaders] {
      date: 'Tue, 28 Nov 2023 19:33:16 GMT',
      server: 'uvicorn',
      'content-length': '61',
      'content-type': 'application/json',
      connection: 'close'
    },
    config: {
      transitional: [Object],
      adapter: [Array],
      transformRequest: [Array],
      transformResponse: [Array],
      timeout: 0,
      xsrfCookieName: 'XSRF-TOKEN',
      xsrfHeaderName: 'X-XSRF-TOKEN',
      maxContentLength: -1,
      maxBodyLength: -1,
      env: [Object],
      validateStatus: [Function: validateStatus],
      headers: [Object [AxiosHeaders]],
      method: 'post',
      url: 'http://127.0.0.1:42110/api/v1/index/update?force=true&client=desktop',
      data: [FormData]
    },
    request: <ref *1> ClientRequest {
      _events: [Object: null prototype],
      _eventsCount: 7,
      _maxListeners: undefined,
      outputData: [],
      outputSize: 0,
      writable: true,
      destroyed: false,
      _last: true,
      chunkedEncoding: false,
      shouldKeepAlive: false,
      maxRequestsOnConnectionReached: false,
      _defaultKeepAlive: true,
      useChunkedEncodingByDefault: true,
      sendDate: false,
      _removedConnection: false,
      _removedContLen: false,
      _removedTE: false,
      strictContentLength: false,
      _contentLength: '2597569',
      _hasBody: true,
      _trailer: '',
      finished: true,
      _headerSent: true,
      _closed: false,
      socket: [Socket],
      _header: 'POST /api/v1/index/update?force=true&client=desktop HTTP/1.1\r\n' +
        'Accept: application/json, text/plain, */*\r\n' +
        'Content-Type: multipart/form-data; boundary=axios-1.6.2-boundary-y0YjPn_vkwvE7iR_rqXweGNFY\r\n' +
        'Authorization: Bearer <REDACTED>\r\n' +
        'User-Agent: axios/1.6.2\r\n' +
        'Content-Length: 2597569\r\n' +
        'Accept-Encoding: gzip, compress, deflate, br\r\n' +
        'Host: 127.0.0.1:42110\r\n' +
        'Connection: close\r\n' +
        '\r\n',
      _keepAliveTimeout: 0,
      _onPendingData: [Function: nop],
      agent: [Agent],
      socketPath: undefined,
      method: 'POST',
      maxHeaderSize: undefined,
      insecureHTTPParser: undefined,
      joinDuplicateHeaders: undefined,
      path: '/api/v1/index/update?force=true&client=desktop',
      _ended: true,
      res: [IncomingMessage],
      aborted: false,
      timeoutCb: null,
      upgradeOrConnect: false,
      parser: null,
      maxHeadersCount: null,
      reusedSocket: false,
      host: '127.0.0.1',
      protocol: 'http:',
      _redirectable: [Writable],
      [Symbol(kCapture)]: false,
      [Symbol(kBytesWritten)]: 0,
      [Symbol(kEndCalled)]: true,
      [Symbol(kNeedDrain)]: true,
      [Symbol(corked)]: 0,
      [Symbol(kOutHeaders)]: [Object: null prototype],
      [Symbol(errored)]: null,
      [Symbol(kUniqueHeaders)]: null
    },
    data: { detail: 'Too many files. Maximum number of files is 1000.' }
  }
}
sabaimran commented 8 months ago

Hi @sobjornstad , sorry for the unpleasant error!

It seems like you are using the Khoj desktop app to send the files over to the server. Thanks for stress-testing like this and documenting the behavior! This is a limitation of our API framework, FastAPI. In the API contract, the library limits the number of input files to 1000. Relevant discussion here: https://github.com/tiangolo/fastapi/discussions/9634.

I'll try to see if there's a more substantial fix we can provide for this, but the temporary workaround would be to try indexing the data incrementally (e.g., select some of the files you want to index, index them, and continue until you're done).

timrogers commented 8 months ago

I'm experiencing this issue too when trying to set up Obsidian with Khoj for the first time 👀

sabaimran commented 8 months ago

@debanjum is working on a client-side workaround to chunk the file upload in groups of 1000. Thanks for y'all's patience!