text2vec models for navigating the R package universe

timchurches commented 7 years ago

This relates to the issues outlined in this blog post by Julia Silge, which arose out of discussions at the US rOpenSci 2017 unconference earlier this year.

Would there be any utility in training text2vec models, using the excellent text2vec package for R by Dmitriy Selivanov, with text extracted from R package documentation - maybe just the package description, or possibly the entire text of the documentation?

This would permit tasks such as topic modelling or similarity detection/finding to be undertaken with the trained word-embedding vectors - with potential application to things like automated creation of task views (possibly on-demand), and finding packages which are similar to a given package.

One nice thing about R package documentation is that it is well-structured and easily parsable/extractable. With > 10k packages available, there should be enough data to train a useful model.

njtierney commented 7 years ago

Great idea!

Myself, @mpadge and @kbenoit discussed a similar idea a while back and called it flipper, the idea being papr but for CRAN - people can enter a search term and then find similar packages based on the name and DESCRIPTION. IIRC, Mark and Ken were discussing building a large tree that would be traversed to find similar packages.

Perhaps flipper could be used to outline one approach to help move this idea forward :)

timchurches commented 7 years ago

Ok, thanks, the flipper package shows how to neatly access R package documentation programmatically. Feeding that to text2vec should be straightforward (hah!).

njtierney commented 7 years ago

flipper can flip the documentation to other services, I like it!

mpadge commented 7 years ago

Yep, strong :+1: to great idea! Pretty much all of what text2vec does has already been implemented in flipper, but via quanteda. flipper produces similarity matrices; does full text analyses; and implements a uniquely useful form of collocation analyses in the context of R package desriptions. (It's buried in the text_to_pkgs() function currently here.) The current flipper::flip() function is based on that, and implicitly quantifies similarities between all packages based on combining measures of textual and collocational similarity.

In short: We just gotta combine forces on this and we'll nail it! flipper is actually pretty much right to roll for standard package discovery tasks, and for navigating through the CRAN network. Main outstanding tasks are:

Integrating other package sources (CRAN, bioc, github, ...)
Putting a fancy pants front end on it (shiny swiping was the original concept, hence "flipper")
Setting up a web service to dump user stats and potential user R stuff like installed package lists to feed in as additional info to the similarity matrix.

That's not that much work, and we'd be good to go. Let's do it!

kbenoit commented 7 years ago

Agreed that this sounds like a cool and worthy idea. I've done nothing with flipper directly yet except to contribute dialogue, but if I'm assigned some issues or tasks I'd be happy to contribute. As of this week, I'm past the worst of the first half of term, and things are more under control, so I could follow up.

rgayler commented 7 years ago

I just noticed this on Rbloggers: http://www.bnosac.be/index.php/blog/70-cran-search-based-on-natural-language-processing

mpadge commented 7 years ago

That's a nifty website - thanks for alerting us to that. It's quite a nice model to take inspiration from for whatever the final form of this ends up being. First thought was the usual, "oh well, it has been done after all, I guess that's the end of flipper." But wait ..

I'm convinced of the importance of testing these kinds of things on non-typical phrases, exemplified in the current README, which has "trees are really quite nice". flipper generally gives one of two results for this (there's a bit of randomness in there to avoid local maxima), both of which contain (words derived from) "tree" and "great". Co-occurrence data from http://datatailor.be:9999/app/cran_search for same phrase gives:

regression	tree	26
classification	regression	20
data	set	18
decision	tree	17
confidence	interval	10
analysis	data	9
machine	learn	9
birth	death	9
data	structure	9
data	be	9

... which really has nothing to do with what I typed in. Conclusion: The need still abides!

rgayler commented 7 years ago

Some directions that might be interesting:

Query by example - Rather than a few search terms it might be useful to allow a document (say, a description of the task you want to undertake) as the query.
Query modification - For example, when querying with A,you might get many answers that include B, which you know not to be relevant - so you should be able to exclude B from the results (project onto the subspace orthogonal to B if you are using a vector embedding model).
Interface to the CRANsearcher Rstudio add-in.
Allow exploration and personal annotation of a package collection - I frequently install packages that I think will be useful later. Consequently I have no idea why I installed most of my local packages. It would be good to be able to explore the conceptual structure of the packages I have installed and annotate them to record why I thought they would b useful.

mpadge commented 7 years ago

Those are such awesome ideas that there's really only one question to ask: How willing and able (time-wise) to help? Or anyone else? I can keep scribbling away at flipper but it's pretty slow progress because of the little time I can devote to it. I suspect your first direction might already work reasonably well in current flipper:flip() functionality. It only accepts command line input, but that'd be easy to change. There'd be a couple of other necessary technical tweaks, but it should just run and give pretty solid results.

Second direction is notionally implemented in flipper::flip(), which currently just asks

Good package?
1: Yes
2: No
3: Maybe

The No answer does not currently actually do a right-angle turn in the vector space, but that is the intention and just has to be implemented. (The text interface is just place-holder waiting for a slicker web-based interface.)

An annotation feature is also a brilliant idea, and connects (in-)directly with recognised need to connect with external data storage facility. Storage locations for such annotations would be somewhat tricky, but resolvable. It's a brilliant idea.

rgayler commented 7 years ago

Ay, there's the rub. Ideas are cheap; implementations, not so much. I am way over-committed.

On 26 October 2017 at 05:49, mark padgham notifications@github.com wrote:

Those are such awesome ideas that there's really only one question to ask: How willing and able (time-wise) to help? Or anyone else? I can keep scribbling away at flipper but it's pretty slow progress because of the little time I can devote to it. I suspect your first direction might already work reasonably well in current flipper:flip() functionality. It only accepts command line input, but that'd be easy to change. There'd be a couple of other necessary technical tweaks, but it should just run and give pretty solid results.

Second direction is notionally implemented in flipper::flip(), which currently just asks

Good package? 1: Yes 2: No 3: Maybe

The No answer does not currently actually do a right-angle turn in the vector space, but that is the intention and just has to be implemented. (The text interface is just place-holder waiting for a slicker web-based interface.)

An annotation feature is also a brilliant idea, and connects (in-)directly with recognised need to connect with external data storage facility. Storage locations for such annotations would be somewhat tricky, but resolvable. It's a brilliant idea.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/ozunconf17/issues/40#issuecomment-339432624, or mute the thread https://github.com/notifications/unsubscribe-auth/AFKJG0izerO5OvAa3So7MBle4NimnzPOks5sv4K4gaJpZM4QDqUZ .

ropensci / ozunconf17

text2vec models for navigating the R package universe #40