Performance evaluation on each taxonomy leaf should be performed as each leaf node in the taxonomy represents one particular skill, or set of knowledge. Example: Complete Common Expression QNA
Performance metrics should be selected or crafted. A basic metric could be correctness (Tracking how many questions from these yaml files that the model answers correctly (for some definition of correctness), over time).
For metrics selection and detailed evaluation rules, I suggest to refer to existing benchmarks, e.g. MMLU, GLUE, etc.
Evaluation of Per Taxonomy Leaf Performance