Automatically generate topics and keywords

mntnr / build-a-space

Automatically add community documentation to your repository

https://maintainer.io

MIT License

21 stars 3 forks source link

Automatically generate topics and keywords #47

Open RichardLitt opened 7 years ago

RichardLitt commented 7 years ago

This will involve a couple of things. First, parsing the README. Second, finding the Description or Background section. Then, either topic extraction or NER of that information, with the goal of seeing if you can automatically suggest topics for the README.

For now, noun phrases may do the trick, in the description, for suggestions. This would be greatly aided by a test database of repositories and topics, however.

RichardLitt commented 7 years ago

It may be possible to bootstrap a learning corpus with this list of topics: https://github.com/github/explore.

andrew commented 7 years ago

A low tech way for projects published to a package manager that supports keywords would be to pull the existing ones from the keywords or tags fields.

I did experiment with pulling interesting words from readmes and descriptions in the Libraries.io codebase using a ruby library called highscore but removed it a while back as the result we're great and it was pretty slow to be running as part of the critical path inside the rails app, main code was here: https://github.com/librariesio/libraries.io/blob/7a15048fe7135052dc3ac9383d13833b5cb1f85b/app/models/readme.rb#L75-L79

RichardLitt commented 7 years ago

A low tech way for projects published to a package manager that supports keywords would be to pull the existing ones from the keywords or tags fields.

Yeah, I already do that for projects which have manifests. I'm trying to think of a better way to extract. I think not using the entire README - just the description and background sections - should help.

I'm going to make a package now to automatically cross-check with topics from github/explore. Might be a solution while we don't have an API for suggesting topics yet from GitHub.

Thanks for the help! Slowness isn't an issue for me, this will be pretty fast I think.

RichardLitt commented 7 years ago

I've started work on this, here: Katahdin.