protontypes / AwesomeCure

Analyze and cure awesome lists by collecting, processing and presenting data from listed Git projects.
MIT License
17 stars 3 forks source link

Extract geographic coverage via NLP #20

Open Ly0n opened 2 years ago

Ly0n commented 2 years ago

Identifying the geographical coverage of the models and data behind the project could be very interesting to detect areas without coverage. This could also be for help to find projects for a specific geographical area they are interested in.

KKulma commented 2 years ago

Agree, this would be valuable information. You mentioned NLP, do you have an idea where to get this info from? Any reliable/consistent source?

Ly0n commented 2 years ago

I never worked with NLP but did some investigations in the past: This framework could be useful for many application to extract more data from the projects: https://github.com/RaRe-Technologies/gensim

On the website repo we also have an issue that is talking about this problem: https://github.com/protontypes/open-sustainable-technology/issues/110

In my view, a first step to get started with NLP would be to create missing topic labels for the projects. For this, one could use the README of the already created projects and their topic as training data. For about 50% of the projects, the topics are missing and could be added to the database in this way.

This would be a clear improvement of the database, would enable much better searches and would also be very interesting in the analysis.

KKulma commented 2 years ago

I think there are several approaches we can consider here. {gensim} uses a pretty simple bag-of-words approach for topic modelling (unsupervised ML) and this method can be effective but very sensitive to corpora content and text-cleaning preprocessing steps, as well as our wild guess of how many topics there may be in the first place. Alternatively, we can see if there's a systematic way we could scrape this information directly from the project's GitHub repo's website and/or (big one!) train a simple supervised algorithm to classify the repo based on the content of README. LOTS OF FUN 💯

Ly0n commented 2 years ago

I think there are several approaches we can consider here. {gensim} uses a pretty simple bag-of-words approach for topic modelling (unsupervised ML) and this method can be effective but very sensitive to corpora content and text-cleaning preprocessing steps, as well as our wild guess of how many topics there may be in the first place. Alternatively, we can see if there's a systematic way we could scrape this information directly from the project's GitHub repo's website and/or (big one!) train a simple supervised algorithm to classify the repo based on the content of README. LOTS OF FUN 100

That should be feasible. I never worked with such frameworks just the classical CNNs for image processing so far. The simplest information we could extract from the READMEs is the linked DOIs URLs. This data could be of importance for classification and labeling. After randomly selecting a few projects from the list almost all of them have DOI URLs to papers related to them. Adding this to the existing data mining script should not be a problem but It could increase the runtime by increasing the number of API calls needed per project.

Ly0n commented 2 years ago

Had some success last night with the DOI extraction. More details in the separated issue https://github.com/protontypes/open-sustainable-technology/issues/172.

The new list is compiling a now CSV file at the moment. It looks like we are getting DOI links for about a quarter of the projects, but we are still missing some.

Let us see if there are open source tools that give us more contextual information based on the DOIs.