Open RichardLitt opened 7 years ago
It may be possible to bootstrap a learning corpus with this list of topics: https://github.com/github/explore.
A low tech way for projects published to a package manager that supports keywords would be to pull the existing ones from the keywords
or tags
fields.
I did experiment with pulling interesting words from readmes and descriptions in the Libraries.io codebase using a ruby library called highscore but removed it a while back as the result we're great and it was pretty slow to be running as part of the critical path inside the rails app, main code was here: https://github.com/librariesio/libraries.io/blob/7a15048fe7135052dc3ac9383d13833b5cb1f85b/app/models/readme.rb#L75-L79
A low tech way for projects published to a package manager that supports keywords would be to pull the existing ones from the keywords or tags fields.
Yeah, I already do that for projects which have manifests. I'm trying to think of a better way to extract. I think not using the entire README - just the description and background sections - should help.
I'm going to make a package now to automatically cross-check with topics from github/explore. Might be a solution while we don't have an API for suggesting topics yet from GitHub.
Thanks for the help! Slowness isn't an issue for me, this will be pretty fast I think.
I've started work on this, here: Katahdin.
This will involve a couple of things. First, parsing the README. Second, finding the Description or Background section. Then, either topic extraction or NER of that information, with the goal of seeing if you can automatically suggest topics for the README.
For now, noun phrases may do the trick, in the description, for suggestions. This would be greatly aided by a test database of repositories and topics, however.