instructlab / taxonomy

Taxonomy tree that will allow you to create models tuned with your data
Apache License 2.0
143 stars 481 forks source link

Proposal: Change `document` to `source` #661

Open anik120 opened 2 months ago

anik120 commented 2 months ago

Capturing a discussion with @shivchander:

I was writing up a test case for the lmdk cli to test knowledge workflow, but the way that I laid out my qna.yaml is as follows:

test_knowledge_valid = b"""created_by: test-bot
seed_examples:
- question: What is Operator Framework? 
  answer: 'The Operator Framework is a set of Kubernetes components and developer tools, 
  that aid in Operator development and central management on a multi-tenant cluster.'
- question: What is an Operator? 
  answer: 'The goal of an Operator is to put operational knowledge into software. 
  Previously this knowledge only resided in the minds of administrators, 
  various combinations of shell scripts or automation software like Ansible. 
  It was outside of your Kubernetes cluster and hard to integrate. 
  With Operators, CoreOS changed that. Operators implement and automate 
  common Day-1 (installation, configuration, etc.) and Day-2 (re-configuration, 
  update, backup, failover, restore, etc.) activities in a piece of software running 
  inside your Kubernetes cluster, by integrating natively with Kubernetes concepts and APIs. 
  We call this a Kubernetes-native application. 
  With Operators you can stop treating an application as a collection of primitives like Pods, 
  Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs 
  that make sense for the application.'
- question: What is Operator Lifecycle Manager? 
  answer: 'OLM is a component of the Operator Framework, 
  an open source toolkit to manage Kubernetes native applications, 
  called Operators, in an effective, automated, and scalable way. 
  OLM extends Kubernetes to provide a declarative way to install, 
  manage, and upgrade Operators and their dependencies in a cluster.
task_description: to teach a large language model about the Operator Framework
document:
  repo: https://github.com/anik120/knowledge-doc-test
  commit: bf78d868f544e55d8e1d99f68d9105fc3b8751bd
  patterns:
  - operator-framework*.md

Essentially, the seed_example question/answers I have there are from the overarching project websites https://operatorframework.io/, https://olm.operatorframework.io/ and https://sdk.operatorframework.io/, and the documents I have in https://github.com/anik120/knowledge-doc-test are README.mds from the components' GitHub repositories. In other words, the seed_example question/answers do not actually come from the documents hosted in document.repo.

The way I laid things out, the seed_examples are "product pitch/summary description" and document.repo contains all the docs I want the model to learn about.

Shiv tells me that that's the wrong way of thinking about it, and the verb document should be source in reality, and seed_examples are examples of questions/answers that can be answered by the model once it's been trained on the docs hosted in docs.repo.

Eureka moment: Even after learning how the taxonomy interacts with the model, I was thinking about the structure of my qna.yaml, the wrong way. It's likely that other users will also confuse the taxonomy/model interactions and lay out the qna.yaml files the wrong way, leading to PR submissions that'll likely not improve model quality. only a little while ago, ie fresh info being processed by brain still

Proposed fix: Change document to source

cc: @xukai92 @abhi1092 @aldopareja

anik120 commented 2 months ago

Capture same comment here too https://github.com/instruct-lab/cli/pull/776#issuecomment-2040157061

PS: this issue is just a proposal, in the hopes that a discussion will ensue about the priority of this work. Totally reasonable to just ignore is if others don't see it as a high priority issue/change will take a lot of effort to get through before opening and team does not have cycles to implement the change 😀

bjhargrave commented 2 months ago

This issue should probably be in the https://github.com/instruct-lab/schema/ repo.

bjhargrave commented 2 months ago

@anik120 I think it is beyond when this change could be made. Perhaps we can close this issue?