Add a schema for validating pipeline configuration

When we load a taxonomy yaml file, we validate its contents against a schema contained in instructlab.schema. Here is an example schema:

https://github.com/instructlab/schema/blob/main/src/instructlab/schema/v2/knowledge.json

There's some code for loading this schema and validating with it in instructlab.sdg.utils.taxonomy.

It would be nice if we did something similar when we load a pipeline yaml. It can help catch subtle mistakes. We could also provide instructions for how to check a configuration against the schema manually before trying to run it.

Here is a start at what a schema could look like for pipeline configs (auto-generated, not tested yet).

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "version": {
      "type": "string"
    },
    "blocks": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "type": {
            "type": "string"
          },
          "config": {
            "type": "object",
            "properties": {
              "config_path": {
                "type": "string"
              },
              "output_cols": {
                "type": "array",
                "items": {
                  "type": "string"
                }
              }
            },
            "required": ["config_path", "output_cols"]
          },
          "gen_kwargs": {
            "type": "object",
            "properties": {
              "temperature": {
                "type": "number"
              },
              "max_tokens": {
                "type": "integer"
              },
              "n": {
                "type": "integer"
              }
            }
          },
          "drop_duplicates": {
            "type": "array",
            "items": {
              "type": "string"
            }
          }
        },
        "required": ["name", "type", "config"]
      }
    }
  },
  "required": ["version", "blocks"]
}

instructlab / sdg

Add a schema for validating pipeline configuration #131