Open vitorhcl opened 6 months ago
Organizing categories into macro-categories, or "groups", is something that I am thinking since a long time, but did not yet tackled as a somehow relatively big change in the organization of the data.
I completely agree with your analysis. That's something that will happen quite for sure.
But since it would require some reorganization of the README file too, and of course some changes to the Python script to generate it, it is a change that have to be carefully planned.
One idea would be to start incrementally, by introducing the groups in the CSV and keep using the categories in the README.
In practice, there is a number of decisions to take here :-)
Your suggested groups sound good: thanks for the contribution! However, a partial indication of which categories go into which group is - unfortunately - not sufficient to proceed: every category must be assigned to a group. Unless, again, we think to proceed incrementally, keeping a group called e.g. "Misc" for all the categories that have not found their placing yet.
Although all such "incremental" approaches may sound a half-baked solution.
Overall, I/we should find enough time to work on it.
@toolleeo Yeah, your plan on incremental changes, adding a column just on the CSV first and then updating the Python script, is indeed a good one.
As for the script change, I agree that it will be not exactly trivial to implement.
However, I would't say that it would be hard to do because the main aspects of the script are legibility and mantainability.
IMO the performance of the script is not a critical aspect, if the script takes 500ms to run or even 1s it is totally fine, because it only needs to be run when the data changes.
I definitely agree that the performance of the script is not relevant. Moreover, I do not expect such a worsening of the performance.
The main point is find a suitable assignment for all or almost all the categories.
Definitely, the main problem is the groups. I was going to talk about it, but I was a bit busy. Here we go:
I think that if some categories on the CSV are left with no group it's fine for some time, IMO it would look good enough if, in the index, we display the categories without a group outside of a group
Nowadays, IMO we don't need to take so much time thinking on which categories to use because we have extremely good LLMs (Large Language Model) such as ChatGPT and even some quite impressive open source ones, given that OpenAI has one of the largest public tokens dataset, if not the largest one.
We can take just the categories.csv
and feed it into an LLM and that is all it needs to help us in this task (or even improve the existing categories :)).
To create some groups, all we need is a good use of the best prompting techniques (if anyone can has more knowledge on this, please help me).
From that point, we can just adjust the LLM result a little bit to make the 'final" version (sure, it can change later if someone finds an issue or if the categories change).
We can combine different techniques to get the exact groups that we want:
What about doing this? In my opinion, LLMs (at least ChatGPT, that's what I often use in my daily life) are currently very good for generating content about more open and not specific tasks.
Altough they can make many mistakes on specific tasks like programming and math, for example, their ability to connect distant concepts is quite impressive.
That's why I think they would do an excellent job on this task if we know how to use them.
Wow, honestly I admit that I did not think about using a LLM to come up with the automatic generation / suggestion of the groups :-) I was thinking to a manual analysis of the categories to find a suitable grouping, which I think it is still a viable option, since the number of categories is not so large: we are not dealing with hundreds or thousands of items. But your proposal it's definitely interesting.
It sounds like topic modeling applied to our dataset made by the categories. IMHO, however, it would require a good and meaningful description of the categories, which should be checked beforehand, otherwise it may be hard for the LLM to understand the "scope" of the task.
Probably a test would not require too much effort.
Have you the chance to try it out?
BTW, one thing that I thought in the past was about the automatic generation of the description, starting from the README or similar, using an LLM. But this is probably a topic for another thread :-)
Wow, honestly I admit that I did not think about using a LLM to come up with the automatic generation / suggestion of the groups :-) I was thinking to a manual analysis of the categories to find a suitable grouping, which I think it is still a viable option, since the number of categories is not so large: we are not dealing with hundreds or thousands of items. But your proposal it's definitely interesting.
It sounds like topic modeling applied to our dataset made by the categories. IMHO, however, it would require a good and meaningful description of the categories, which should be checked beforehand, otherwise it may be hard for the LLM to understand the "scope" of the task.
Probably a test would not require too much effort.
Have you the chance to try it out?
I made a quick test just feeding it with the categories I think last month and its response was not bad at a all, so I think it will be good :)
But yeah, I agree that, at the moment, it's not really a huge deal, but a LLM can help us wkth maybe new ideas and other perspectives.
BTW, one thing that I thought in the past was about the automatic generation of the description, starting from the README or similar, using an LLM. But this is probably a topic for another thread :-)
Wow, that is a good idea for sure, I think a Python script that takes the readme using the GitHub API or even one that uses the GitHub website itself would work quite well.
Maybe even have an option to pass additional context like documentation or even code.
We definitely have to discuss this :rocket:
Have you the chance to try it out?
I'll try to try it out in a few days or maybe hours, let's see what we get.
I don't know if @toolleeo has interest in this, but I'll describe my proposal.
TL;DR
There are too many categories now, so I propose adding a "group" column in
categories.csv
for, just in the index, grouping the existing categories.This would make the management and use of this list much better.
Motivation
Current approach
The tools are organized into one level of hierarchy, in which the categories include multiple tools. An app can only belong to one category.
My proposal
To address the above cons that are a consequence of the project growth, I propose adding groups that include multipe categories.
Implementation
IMO it seems relatively easy to implement:
Pros and cons
IMO the pros of the new approach and the cons of the current approach overcomes, by far, the pros of the current approach and the cons of the new approach.
Groups suggestions
I'm not exactly sure of each groups we could use but some suggestions are: