Open timchurches opened 7 years ago
Great idea!
Myself, @mpadge and @kbenoit discussed a similar idea a while back and called it flipper, the idea being papr
but for CRAN - people can enter a search term and then find similar packages based on the name and DESCRIPTION. IIRC, Mark and Ken were discussing building a large tree that would be traversed to find similar packages.
Perhaps flipper could be used to outline one approach to help move this idea forward :)
Ok, thanks, the flipper package shows how to neatly access R package documentation programmatically. Feeding that to text2vec should be straightforward (hah!).
flipper can flip the documentation to other services, I like it!
Yep, strong :+1: to great idea! Pretty much all of what text2vec
does has already been implemented in flipper
, but via quanteda
. flipper
produces similarity matrices; does full text analyses; and implements a uniquely useful form of collocation analyses in the context of R package desriptions. (It's buried in the text_to_pkgs()
function currently here.) The current flipper::flip()
function is based on that, and implicitly quantifies similarities between all packages based on combining measures of textual and collocational similarity.
In short: We just gotta combine forces on this and we'll nail it! flipper
is actually pretty much right to roll for standard package discovery tasks, and for navigating through the CRAN network. Main outstanding tasks are:
That's not that much work, and we'd be good to go. Let's do it!
Agreed that this sounds like a cool and worthy idea. I've done nothing with flipper
directly yet except to contribute dialogue, but if I'm assigned some issues or tasks I'd be happy to contribute. As of this week, I'm past the worst of the first half of term, and things are more under control, so I could follow up.
I just noticed this on Rbloggers: http://www.bnosac.be/index.php/blog/70-cran-search-based-on-natural-language-processing
That's a nifty website - thanks for alerting us to that. It's quite a nice model to take inspiration from for whatever the final form of this ends up being. First thought was the usual, "oh well, it has been done after all, I guess that's the end of flipper
." But wait ..
I'm convinced of the importance of testing these kinds of things on non-typical phrases, exemplified in the current README
, which has "trees are really quite nice". flipper
generally gives one of two results for this (there's a bit of randomness in there to avoid local maxima), both of which contain (words derived from) "tree" and "great". Co-occurrence data from http://datatailor.be:9999/app/cran_search for same phrase gives:
regression | tree | 26 |
---|---|---|
classification | regression | 20 |
data | set | 18 |
decision | tree | 17 |
confidence | interval | 10 |
analysis | data | 9 |
machine | learn | 9 |
birth | death | 9 |
data | structure | 9 |
data | be | 9 |
... which really has nothing to do with what I typed in. Conclusion: The need still abides!
Some directions that might be interesting:
Query by example - Rather than a few search terms it might be useful to allow a document (say, a description of the task you want to undertake) as the query.
Query modification - For example, when querying with A,you might get many answers that include B, which you know not to be relevant - so you should be able to exclude B from the results (project onto the subspace orthogonal to B if you are using a vector embedding model).
Interface to the CRANsearcher Rstudio add-in.
Allow exploration and personal annotation of a package collection - I frequently install packages that I think will be useful later. Consequently I have no idea why I installed most of my local packages. It would be good to be able to explore the conceptual structure of the packages I have installed and annotate them to record why I thought they would b useful.
Those are such awesome ideas that there's really only one question to ask: How willing and able (time-wise) to help? Or anyone else? I can keep scribbling away at flipper
but it's pretty slow progress because of the little time I can devote to it. I suspect your first direction might already work reasonably well in current flipper:flip()
functionality. It only accepts command line input, but that'd be easy to change. There'd be a couple of other necessary technical tweaks, but it should just run and give pretty solid results.
Second direction is notionally implemented in flipper::flip()
, which currently just asks
Good package?
1: Yes
2: No
3: Maybe
The No
answer does not currently actually do a right-angle turn in the vector space, but that is the intention and just has to be implemented. (The text interface is just place-holder waiting for a slicker web-based interface.)
An annotation feature is also a brilliant idea, and connects (in-)directly with recognised need to connect with external data storage facility. Storage locations for such annotations would be somewhat tricky, but resolvable. It's a brilliant idea.
Ay, there's the rub. Ideas are cheap; implementations, not so much. I am way over-committed.
On 26 October 2017 at 05:49, mark padgham notifications@github.com wrote:
Those are such awesome ideas that there's really only one question to ask: How willing and able (time-wise) to help? Or anyone else? I can keep scribbling away at flipper but it's pretty slow progress because of the little time I can devote to it. I suspect your first direction might already work reasonably well in current flipper:flip() functionality. It only accepts command line input, but that'd be easy to change. There'd be a couple of other necessary technical tweaks, but it should just run and give pretty solid results.
Second direction is notionally implemented in flipper::flip(), which currently just asks
Good package? 1: Yes 2: No 3: Maybe
The No answer does not currently actually do a right-angle turn in the vector space, but that is the intention and just has to be implemented. (The text interface is just place-holder waiting for a slicker web-based interface.)
An annotation feature is also a brilliant idea, and connects (in-)directly with recognised need to connect with external data storage facility. Storage locations for such annotations would be somewhat tricky, but resolvable. It's a brilliant idea.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/ozunconf17/issues/40#issuecomment-339432624, or mute the thread https://github.com/notifications/unsubscribe-auth/AFKJG0izerO5OvAa3So7MBle4NimnzPOks5sv4K4gaJpZM4QDqUZ .
This relates to the issues outlined in this blog post by Julia Silge, which arose out of discussions at the US rOpenSci 2017 unconference earlier this year.
Would there be any utility in training text2vec models, using the excellent text2vec package for R by Dmitriy Selivanov, with text extracted from R package documentation - maybe just the package description, or possibly the entire text of the documentation?
This would permit tasks such as topic modelling or similarity detection/finding to be undertaken with the trained word-embedding vectors - with potential application to things like automated creation of task views (possibly on-demand), and finding packages which are similar to a given package.
One nice thing about R package documentation is that it is well-structured and easily parsable/extractable. With > 10k packages available, there should be enough data to train a useful model.