instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
21 stars 34 forks source link

Add a schema for validating pipeline configuration #131

Closed russellb closed 4 months ago

russellb commented 4 months ago

When we load a taxonomy yaml file, we validate its contents against a schema contained in instructlab.schema. Here is an example schema:

https://github.com/instructlab/schema/blob/main/src/instructlab/schema/v2/knowledge.json

There's some code for loading this schema and validating with it in instructlab.sdg.utils.taxonomy.

It would be nice if we did something similar when we load a pipeline yaml. It can help catch subtle mistakes. We could also provide instructions for how to check a configuration against the schema manually before trying to run it.

Here is a start at what a schema could look like for pipeline configs (auto-generated, not tested yet).

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "version": {
      "type": "string"
    },
    "blocks": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "type": {
            "type": "string"
          },
          "config": {
            "type": "object",
            "properties": {
              "config_path": {
                "type": "string"
              },
              "output_cols": {
                "type": "array",
                "items": {
                  "type": "string"
                }
              }
            },
            "required": ["config_path", "output_cols"]
          },
          "gen_kwargs": {
            "type": "object",
            "properties": {
              "temperature": {
                "type": "number"
              },
              "max_tokens": {
                "type": "integer"
              },
              "n": {
                "type": "integer"
              }
            }
          },
          "drop_duplicates": {
            "type": "array",
            "items": {
              "type": "string"
            }
          }
        },
        "required": ["name", "type", "config"]
      }
    }
  },
  "required": ["version", "blocks"]
}
russellb commented 4 months ago

I can't say I love how verbose the schema file is...

One thing I have in mind for this is automating the validation of custom pipeline configs in other repos. Those are not as easy to test when code in the repo is changes, so I'm hoping this can help catch some accidental compatibility breakages that merge into the tree that don't affect configs in tree, but might affect custom ones elsewhere.