instructlab / dev-docs

Developer documents for the InstructLab organization
Apache License 2.0
4 stars 31 forks source link

Proposed Taxonomy tree #46

Closed jjasghar closed 6 months ago

jjasghar commented 6 months ago

Creating a logical layout of the taxonomy tree needs to be agreed upon. The triage team has landed on the Wikipedia tree, and this PR is to help justify and enforce this decision.

jjasghar commented 6 months ago

Are there docs on this anywhere I'm missing?

Maybe @mairin may know, but from what I understand it just existed when we started. I don't know of any design docs, hence our reason for trying to find some logic to this.

bjhargrave commented 6 months ago

So is this for knowledge only? Wikipedia can be thought of as a source of knowledge and it has a taxonomy for its knowledge. But I am not sure the wikipedia taxonomy has much applicability to compositional skills.

jjasghar commented 6 months ago

So is this for knowledge only? Wikipedia can be thought of as a source of knowledge and it has a taxonomy for its knowledge.

I would assume so, anything under /knowledge, it does bring up the /skills tree, but with all skills I don't know of anything that reflects a taxonomy tree there.

bjhargrave commented 6 months ago

I would assume so, anything under /knowledge

OK, we should then make it clear in the title of this PR and the text that the proposal applies to the knowledge part of the taxonomy.

jjasghar commented 6 months ago

Updated per BJ's request for clarification.

mairin commented 6 months ago

Maybe @mairin may know, but from what I understand it just existed when we started. I don't know of any design docs, hence our reason for trying to find some logic to this.

@shivchander Does this proposal seem reasonable to you?

shivchander commented 6 months ago

So before agreeing to "sure the Wikipedia hierarchy makes sense," I'd love to understand some corresponding structure to keep a limited set of people focused on an achievable outcome. Are there docs on this anywhere I'm missing?

So we started targeting the domains which improve MMLU, and we have this list https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md - this could serve as a starting point to accept contributions

I like the idea of this PR and the one from Ming (https://github.com/instructlab/taxonomy/pull/780), both serve as a good way to organize the knowledge tree.

obuzek commented 6 months ago

I think especially since the initial knowledge contributions are from Wikipedia, this is as good a format as any. I wouldn't be surprised if we need to deviate from this down the line - but to @russellb's point I'm expecting domains that are more highly specific to a use case to end up in either another top-level directory or a different repository.

Examples of documents that might cause us to reconsider this structure: policy documents, documentation, contracts, legal case filings, sales copy. Those wouldn't natively fit in a Wikipedia-style structure.

For now this will work.

jjasghar commented 6 months ago

Should we also include skills or is it sufficient to handle knowledge and skills separately.

I think we should figure out compositional_skills/ differently. It will be significantly more subjective, which will require real conversations. Knowledge, on the other hand, if we have an agreed-upon template/formation, it's much easier to justify why.

I have no idea (yet) how skills will develop, but we should absolutely start researching "skills trees" in this space.

hickeyma commented 6 months ago

Some topical input from InstructLab slack:

"Question: can a knowledge contribution be from a source other than the wikipedia? Context: the AI Alliance is creating a set of reference implementations/use cases and one of the suggested reference implementations is a legal chatbot (based on instructlab) that answer questions about the GDPR (General Data Protection Regulation). We would like to contribute the text of the law as knowledge to instructlab. Would that be an acceptable contribution?"

and:

"Is it possible to have a new domain if its not listed in current ones ? https://github.com/instructlab/taxonomy/blob/main/knowledge/knowledge_domains.md "

jjasghar commented 6 months ago

I believe the plan is eventually, but until that is announced only Wikipedia is accepted.

@lhawthorn should we get a standard blurb together about accepting things other then Wikipedia and the expectation of when it can happen?

lhawthorn commented 6 months ago

@jjasghar We should absolutely do so.

@obuzek I note your comment "Examples of documents that might cause us to reconsider this structure: policy documents, documentation, contracts, legal case filings, sales copy. Those wouldn't natively fit in a Wikipedia-style structure."

I have heard from two different people who are interested in teaching InstructLab about legal texts (e.g. GDPR regulation) and about software CVE information (which I, perhaps naively, think of as documentation)

If we went with the Wikipedia structure for taxonomy, how would we accommodate these use cases?

I may not understand the problem space well enough, but appreciate the opportunity to be better educated.

obuzek commented 6 months ago

@lhawthorn This is pure speculation, but I almost wonder if there's not a need for a "foundational knowledge" tree that's different based on the type of data it is. CVE info would live most happily in a CVE-specific organizational taxonomy, and really if one of these is relevant to your use case, you'd want to have the ability to filter by document type.

So maybe that calls for a high-level folder within knowledge labeled wikipedia, so that we have room to expand later.

Also I appreciate that you mentioned CVE info because I was very close to going back and editing my first message to add that exact case 😄. (Also journal articles, news, first person accounts ...)

obuzek commented 6 months ago

@bjhargrave I thought the prompt for SDG was still including the folder path. Can you confirm?

lhawthorn commented 6 months ago

TIL I could quote reply in GH Issues. Yay me!

@obuzek I do believe we should plan for domain specific taxonomies. (In fact, I know there is an open issue suggesting same somewhere else, but my search skills to find it are currently failing me.)

I absolutely envision a future where people will want domain specific taxonomies; perhaps we would be able to offer smaller footprint models based off these domain specific taxonomies at some point. We should plan for that future.

hickeyma commented 6 months ago

Yesterday it was confirmed by @shivchander that the folder structure is not relevant to the InstructLab SDG/training processes. It is for humans to organize the taxonomy. So using wikipedia as the knowledge taxonomy organizing principle is as good as any choice.

Thanks @bjhargrave for that feedback. Based on this and the need to come up with a standard to start with, then I am ok to approve with a view to extending in the future.

hickeyma commented 6 months ago

@russellb Are you ok to move forward with this PR as a start to bedding down standards for the taxonomy tree?

russellb commented 6 months ago

@russellb Are you ok to move forward with this PR as a start to bedding down standards for the taxonomy tree?

Yes, to be clear my review was not a "-1", just a comment. Don't block on me. (I try to always use "Request Changes" in a review to indicate when I want to block on changes I'm asking for).