protontypes / open-sustainable-technology

A directory and analysis of the open source ecosystem in the areas of climate change, sustainable energy, biodiversity and natural resources.
https://opensustain.tech
Creative Commons Attribution 4.0 International
1.95k stars 226 forks source link

Improve Clustering of Projects #271

Open Ly0n opened 6 months ago

Ly0n commented 6 months ago

It has been shown that a finer and more detailed structure of the subject areas helps many users of the database to find suitable projects. Therefore, any clustering of projects by topics is welcome as long as the clusters are not too small.

lappemic commented 2 months ago

Hey @Ly0n, came across the website and its accompanying repo this morning when i was browsing for open source projects in sustainability. 😄 In the project classification i struggled with exactly this point. I think you have a really valuable directory here! Clustering it appropriately would make it way more accessible. So i let chatGPT cluster it for better accessibility and this is what it came up with. I think it is good organized due to the additional cluster level. Additionally i let it sort the list in alphabetical order, which also helps to find what one is looking for faster in my opinion.

Energy Technologies

Environmental Management and Conservation

Knowledge Sharing and Data Access

Sustainable Development and Infrastructure

Water, Land, and Air Management

What do you think? I would be happy to implement this if you think it might be helpful to you.

PS. I think you also thought about it, but having a searchable tag management would be really helpful as well.

Ly0n commented 2 months ago

@lappemic Thank you for taking up this topic. It is true that there is a lot of potential in the clustering, tagging and in the presentation of the projects. The clustering that ChatGPT has done is not bad, but some things are strange, such as splitting climate to "Natural hazards and environmental policy" and "Climate and atmosphere". Hydrosphere and hydrology are also wrong. I see your comment more as an impetus for a discussion on how to sort this better in the future.

In the past, the idea was to cluster the projects based on the READMEs of the projects or using the oneliner. This would give the LLM more contextual information. Do you think this is easy to implement? It would be great to have someone to collaborate on this.

lappemic commented 2 months ago

Tahnks @Ly0n for the feedback and the opportunity to collaborate on this. Yeah, there are indeed some misclassifications. I thought it was a good starting point for discussions but did not put in much time to properly check every entry, sorry for this.

I like the idea of using the README or the oneliner of each file to give it context for better clustering. I just tried the dummy method and pasted everything: Obviously the contextwindow is too short 😅 But we could chop the list up and paste it partwise or just do it programmatically via api call and send it always just one project and update e.g. json. I think it would take ~1 hour to do it via chat interface and ~3 hours to do it programmatically. As said, i would like to support you here. What would you suggest?

Ly0n commented 2 months ago

Something programmatic is definitely something we are looking for because we are doing science on the metadata. The data processing should be reproducible so that we can repeat it from time to time. You can find a CSV file of all the READMEs here: https://github.com/protontypes/AwesomeCure/blob/main/csv/projects_with_readme.csv

Creating a consistent set of labels / tags or just some statistics about topics would be super awesome.
If you have some initial code snippets just let me know and I can jump in so we can hack together! :dancing_men:

lappemic commented 2 months ago

Hey @Ly0n, very sorry for the very late answer! 🙈

Unfortunately i do not have any snippets right here or a proper idea on how to approach further without a bigger effort atm. I will keep this in mind and pop by asap as i have some more resources! Or if you have something concrete to develop or want a short sync, just ping me. I am open for suggestions as well! :)

Ly0n commented 1 month ago

Without deeper NLP experience it is quite difficult to approach this topic programmatically. You could also manually separate different topics like energy systems into more subtopics.

What I can also offer is support in getting started with NLP. Even some very simple statistics about the wording in the projects README and description could help us a lot.

Some ideas and code snipptes how to get started can be found here: https://github.com/protontypes/open-sustainable-technology/issues/145

If you want to chat about this in person please contact me at tobias.augspurger@protontypes.eu.

lappemic commented 1 month ago

Just sent you an email.