Organise the categories into groups

vitorhcl commented 6 months ago

I don't know if @toolleeo has interest in this, but I'll describe my proposal.

TL;DR

There are too many categories now, so I propose adding a "group" column in categories.csv for, just in the index, grouping the existing categories.

This would make the management and use of this list much better.

Motivation

Current approach

The tools are organized into one level of hierarchy, in which the categories include multiple tools. An app can only belong to one category.

Pros
- Easy to understand
- Simpler to generate the Markdown
Cons
- When there are too many categories, it becomes quite hard to navigate in the list or to search for a specific category
- It's also a bit confusing for newcommers at a first glance for knowing where to add tools

My proposal

To address the above cons that are a consequence of the project growth, I propose adding groups that include multipe categories.

Implementation

IMO it seems relatively easy to implement:

CSV data: Add a "group" column for the categories CSV that contains the group name, where each category would belong to one group.
- Note: the apps CSV wouldn't require a modification.
Python rendering: After that, just the index rendering part of the Python script would have to be modified.

Pros and cons

Pros
- A much better organization given how many categories we have at the moment
- Makes it easier for everyone to find and add the desired apps
Cons
- A bit harder to group the categories and mantain that hierarchy, but they don't change that often

IMO the pros of the new approach and the cons of the current approach overcomes, by far, the pros of the current approach and the cons of the new approach.

Groups suggestions

I'm not exactly sure of each groups we could use but some suggestions are:

Mutltimedia (viewers, video, audio, graphics...)
File handling (deletion, organization, management...)
Productivity (todo managers, time trackers, email...)
System (system tools (the display name could be renamed to "Others" while mantaining its internal label), system monitors, process viewers, package managers...)

toolleeo commented 6 months ago

Organizing categories into macro-categories, or "groups", is something that I am thinking since a long time, but did not yet tackled as a somehow relatively big change in the organization of the data.

I completely agree with your analysis. That's something that will happen quite for sure.

But since it would require some reorganization of the README file too, and of course some changes to the Python script to generate it, it is a change that have to be carefully planned.

One idea would be to start incrementally, by introducing the groups in the CSV and keep using the categories in the README.

In practice, there is a number of decisions to take here :-)

Your suggested groups sound good: thanks for the contribution! However, a partial indication of which categories go into which group is - unfortunately - not sufficient to proceed: every category must be assigned to a group. Unless, again, we think to proceed incrementally, keeping a group called e.g. "Misc" for all the categories that have not found their placing yet.

Although all such "incremental" approaches may sound a half-baked solution.

Overall, I/we should find enough time to work on it.

vitorhcl commented 6 months ago

@toolleeo Yeah, your plan on incremental changes, adding a column just on the CSV first and then updating the Python script, is indeed a good one.

As for the script change, I agree that it will be not exactly trivial to implement.

However, I would't say that it would be hard to do because the main aspects of the script are legibility and mantainability.

IMO the performance of the script is not a critical aspect, if the script takes 500ms to run or even 1s it is totally fine, because it only needs to be run when the data changes.

toolleeo commented 6 months ago

I definitely agree that the performance of the script is not relevant. Moreover, I do not expect such a worsening of the performance.

The main point is find a suitable assignment for all or almost all the categories.

vitorhcl commented 6 months ago

Definitely, the main problem is the groups. I was going to talk about it, but I was a bit busy. Here we go:

Gradual group assigning

I think that if some categories on the CSV are left with no group it's fine for some time, IMO it would look good enough if, in the index, we display the categories without a group outside of a group

Choosing goups

Nowadays, IMO we don't need to take so much time thinking on which categories to use because we have extremely good LLMs (Large Language Model) such as ChatGPT and even some quite impressive open source ones, given that OpenAI has one of the largest public tokens dataset, if not the largest one.

Promtping approach

We can take just the categories.csv and feed it into an LLM and that is all it needs to help us in this task (or even improve the existing categories :)).

To create some groups, all we need is a good use of the best prompting techniques (if anyone can has more knowledge on this, please help me).

From that point, we can just adjust the LLM result a little bit to make the 'final" version (sure, it can change later if someone finds an issue or if the categories change).

Techniques

We can combine different techniques to get the exact groups that we want:

Trying different prompts, IMO the most important one in this context
- Helps us to know what works best
- Avoids eventual hallucinations by finetuning the prompt for this purpose
- Explores better the probabilistic nature of LLMs, getting different perspectives on the same problem
Being specific and not vague
- It's better to specify how we want the groups rather than leaving it open
  - For example, we can specify a number of groups (like 5, 10, 15, we have to test what works best) or the grouping criteria, like tasks, similarity, ... or even both a number and a criteria (again, let's test :))
Use the LLM to get better prompts
- For example, it can even generate the initial prompt, we can fed it with this comment and other related resources or just the core informations (test)
- It can also give us an idea for the grouping criteria we want based on the different categories
Maybe it can help us to improve both groups and categories if the LLM used has a large token limit?

Conclusion

What about doing this? In my opinion, LLMs (at least ChatGPT, that's what I often use in my daily life) are currently very good for generating content about more open and not specific tasks.

Altough they can make many mistakes on specific tasks like programming and math, for example, their ability to connect distant concepts is quite impressive.

That's why I think they would do an excellent job on this task if we know how to use them.

toolleeo commented 6 months ago

Wow, honestly I admit that I did not think about using a LLM to come up with the automatic generation / suggestion of the groups :-) I was thinking to a manual analysis of the categories to find a suitable grouping, which I think it is still a viable option, since the number of categories is not so large: we are not dealing with hundreds or thousands of items. But your proposal it's definitely interesting.

It sounds like topic modeling applied to our dataset made by the categories. IMHO, however, it would require a good and meaningful description of the categories, which should be checked beforehand, otherwise it may be hard for the LLM to understand the "scope" of the task.

Probably a test would not require too much effort.

Have you the chance to try it out?

BTW, one thing that I thought in the past was about the automatic generation of the description, starting from the README or similar, using an LLM. But this is probably a topic for another thread :-)

vitorhcl commented 6 months ago

Wow, honestly I admit that I did not think about using a LLM to come up with the automatic generation / suggestion of the groups :-) I was thinking to a manual analysis of the categories to find a suitable grouping, which I think it is still a viable option, since the number of categories is not so large: we are not dealing with hundreds or thousands of items. But your proposal it's definitely interesting.

It sounds like topic modeling applied to our dataset made by the categories. IMHO, however, it would require a good and meaningful description of the categories, which should be checked beforehand, otherwise it may be hard for the LLM to understand the "scope" of the task.

Probably a test would not require too much effort.

Have you the chance to try it out?

I made a quick test just feeding it with the categories I think last month and its response was not bad at a all, so I think it will be good :)

But yeah, I agree that, at the moment, it's not really a huge deal, but a LLM can help us wkth maybe new ideas and other perspectives.

BTW, one thing that I thought in the past was about the automatic generation of the description, starting from the README or similar, using an LLM. But this is probably a topic for another thread :-)

Wow, that is a good idea for sure, I think a Python script that takes the readme using the GitHub API or even one that uses the GitHub website itself would work quite well.

Maybe even have an option to pass additional context like documentation or even code.

We definitely have to discuss this :rocket:

vitorhcl commented 6 months ago

Have you the chance to try it out?

I'll try to try it out in a few days or maybe hours, let's see what we get.

toolleeo / awesome-cli-apps-in-a-csv