instructlab / taxonomy

Taxonomy tree that will allow you to create models tuned with your data
Apache License 2.0
179 stars 631 forks source link

Taxonomy reorg per dewey decimal classifications #1215

Closed mcorbin-ibm closed 1 month ago

mcorbin-ibm commented 3 months ago

Reorganized the taxonomy domains and subdomains to align with the Dewey Decimal Classifications

instruct-lab-bot[bot] commented 3 months ago

Beep, boop 🤖, Hi, I'm @instructlab-bot and I'm going to help you with your pull request. Thanks for you contribution! 🎉

I support the following commands:

[!NOTE] Currently only maintainers belongs to [[taxonomy-triagers taxonomy-approvers taxonomy-maintainers labrador-org-maintainers instruct-lab-bot-maintainers]] teams are allowed to run these commands.

mcorbin-ibm commented 3 months ago

@bjhargrave

If we could, I would prefer to keep compositional_skills and knowledge folders free of readme files. That is, any change in these trees are a contribution to the taxonomy rather than some doc improvements. The current readme in knowledge is annoying in this way :-)

Do you mean just the main parent folders (knowledge, compositional_skills, and foundational_skills)? OR, do you not want readme files in the domain/subdomain folders?

We do have a docs folder, where we could put most of the info for the repo, to keep the taxonomy tree folders clear?

I do still need to "document" the taxonomy tree -- which I was going to do in readme.txt files for the domains/subdomains??

bjhargrave commented 3 months ago

Do you mean just the main parent folders (knowledge, compositional_skills, and foundational_skills)? OR, do you not want readme files in the domain/subdomain folders?

My request would to not have readme files anywhere under the taxonomy folders (knowledge, compositional_skills, and foundational_skills).

I do still need to "document" the taxonomy tree -- which I was going to do in readme.txt files for the domains/subdomains??

Agree but I would rather see that all together in a single readme file since that would give the reader a broad view over the taxonomy organization rather than walking around the tree encountering readme files occasionally.

instruct-lab-bot[bot] commented 3 months ago

Beep, boop 🤖, Hi, I'm @instructlab-bot and I'm going to help you with your pull request. Thanks for you contribution! 🎉

I support the following commands:

[!NOTE] Currently only maintainers belongs to [[taxonomy-triagers taxonomy-approvers taxonomy-maintainers labrador-org-maintainers instruct-lab-bot-maintainers]] teams are allowed to run these commands.

instruct-lab-bot[bot] commented 3 months ago

Beep, boop 🤖, Hi, I'm @instructlab-bot and I'm going to help you with your pull request. Thanks for you contribution! 🎉

I support the following commands:

[!NOTE] Currently only maintainers belongs to [[taxonomy-triagers taxonomy-approvers taxonomy-maintainers labrador-org-maintainers instruct-lab-bot-maintainers]] teams are allowed to run these commands.

mcorbin-ibm commented 3 months ago

@jjasghar I have removed the readme files, and updated the main repo's readme file for these changes. There might be some additional changes to verify the qna.yaml files are either "grounded" or "ungrounded" and putting them in those subfolders. And, when we do start merging other knowledge contributions, we need to remember to add the document_type as the final node in the tree. I couldn't quickly find/identify any qna.yaml files to do that with. Please review my latest change here, and I'll let you do the honors of removing DRAFT! :)

jjasghar commented 3 months ago

@bjhargrave can you confirm https://github.com/instructlab/taxonomy/actions/runs/9751722026/job/26913897281?pr=1215 that is the same thing as the /files directory?

makelinux commented 2 months ago

I recognize that unambiguous classification is an extremely complex task. Here are some of my thoughts:

mcorbin-ibm commented 2 months ago

@makelinux

I recognize that unambiguous classification is an extremely complex task. Indeed! :)

  • Often, a topic belongs to multiple categories. For example, an electric battery can be classified under chemistry from a production standpoint, physics based on its function, and electronics based on its usage.

Yes, you can always classify topics into different categories, and I think much will depend upon the specific knowledge being submitted. If the knowledge is talking about its function, then classifying it under physics might be best, but if the knowledge is talking about where batteries are used, then maybe it belongs in technology/electronics. I'm not sure that there is a way around this and it is just something that we will have to make a judgement call as to where a piece of knowledge belongs.

  • In classification, it is essential to determine which categories are top-level and which are subcategories.
  • Looking at the top categories of the Dewey Decimal Classification (DDC), in my opinion, 'science' is merely a form of knowledge and should not be a top-level category.
  • Generally, it seems to me that DDC is more suited to a scientific and academic perspective on printed books and less applicable to all forms of knowledge.

The DDC has been around for nearly 150 years, and is in its 20th edition. It has 10 top categories, subdivided in the 100s, subdivided again to the 1000s. Please see this summaries doc: https://www.oclc.org/content/dam/oclc/dewey/resources/summaries/deweysummaries.pdf. It is meant to be a standard classification of all forms of knowledge. And, we will certainly run into some cases where we will have to find a "best fit" for a knowledge, but starting from the top 10 categories and its 10 subcategories seemed to present the best starting point.

As a secondary source to help us with classification of knowledge, we can look to Wikipedia to see how/where they placed things or to get additional ideas: https://en.wikipedia.org/wiki/Wikipedia:Contents.

The InstructLab taxonomy will not be a direct 1:1 mapping of the DDC, but the starting point to finding a best fit for the topics of knowledge.

  • Consider trying to classify cooking and culinary in DDC. I couldn’t find a suitable category. Can you? Arts and recreation?

For cooking and "culinary arts" I would put them in technology/food_and_drink.

github-actions[bot] commented 2 months ago

This pull request has been automatically marked as stale because it has not had activity within 15 days. It will be automatically closed if no further activity occurs within 31 days.

instruct-lab-bot[bot] commented 2 months ago

Beep, boop 🤖, Hi, I'm @instructlab-bot and I'm going to help you with your pull request. Thanks for you contribution! 🎉

I support the following commands:

[!NOTE] Currently only maintainers belongs to [[taxonomy-triagers taxonomy-approvers taxonomy-maintainers labrador-org-maintainers instruct-lab-bot-maintainers]] teams are allowed to run these commands.

bjhargrave commented 1 month ago

I have force pushed changes which include making .gitignore files empty.

Please don't push any merge commits to the PR.

jjasghar commented 1 month ago

I think it's ready to merge!

https://www.youtube.com/watch?v=NHiUQb5xg7A