gptscript-ai / knowledge

Knowledge for GPTScript
https://gptscript-ai.github.io/knowledge/
Apache License 2.0
29 stars 14 forks source link

feat: ingestion - include metadata from .knowledge.json on dir level #124

Closed iwilltry42 closed 1 month ago

iwilltry42 commented 2 months ago

Ref #118

When ingesting a file or directory (recursively or not), we're now checking if there is a .knowledge.json file present in the directory. It's structured like this:

{
  "metadata": {
    "foo.pdf": {
      "baz": "bom",
      "foo": "bar"
    },
    "somedir/bar.pdf": {
      "x": "y"
    }
  }
}

This will add the defined k/v pairs as metadata to the documents in the vector store. .knowledge.json files in nested directories will be merged (with override) with parent metadata files.

Notes

  1. I went with .knowledge.json instead of .metadata.json because I felt like the latter could be too "common" and we'd run into conflicts. By default, we're including hidden files in the ingestion process, so .knowledge.json is not explicitly being ignored.
  2. It's JSON with an explicit metadata entry so we can add additional fields for new features in the future, e.g. directory content descriptions, etc. which can be merged with dataset metadata for routing retrieval