feat: ingestion - include metadata from .knowledge.json on dir level

Ref #118

When ingesting a file or directory (recursively or not), we're now checking if there is a .knowledge.json file present in the directory. It's structured like this:

{
  "metadata": {
    "foo.pdf": {
      "baz": "bom",
      "foo": "bar"
    },
    "somedir/bar.pdf": {
      "x": "y"
    }
  }
}

This will add the defined k/v pairs as metadata to the documents in the vector store. .knowledge.json files in nested directories will be merged (with override) with parent metadata files.

Notes

I went with .knowledge.json instead of .metadata.json because I felt like the latter could be too "common" and we'd run into conflicts. By default, we're including hidden files in the ingestion process, so .knowledge.json is not explicitly being ignored.
It's JSON with an explicit metadata entry so we can add additional fields for new features in the future, e.g. directory content descriptions, etc. which can be merged with dataset metadata for routing retrieval

gptscript-ai / knowledge

feat: ingestion - include metadata from .knowledge.json on dir level #124

Notes