Avoiding redundant / overlapping packages

sfirke commented 7 years ago

I am not sure if there's a solution, much less one that could happen at the unconf. If it's just a lament, we can close this and move on. But the discussion on #69 has this on my mind and maybe someone else will have an idea.

It's vexing as a user to have packages that do almost entirely the same thing, with no intentional differentiator. E.g., seeing "package X does the same things as Y but with tidyverse-style functions" is great. But for instance, compare the very similar packages needs and pacman. Or - and I think fixing this on GitHub-only packages is hopeless - compare https://github.com/jsng/AirtableR and https://github.com/bergant/airtabler, two R interfaces to the Airtable API that even share the same package name.

One approach could be a series of lit reviews of topics like #69, where someone does a deep dive into the state of packages for a certain task/workflow and writes it up - including noting any voids not covered by current packages, so that someone looking to make a package on that topic can focus there. This could be good for users and potential developers.

Or guidance for R package developers establishing the norm that before you write a package, do extensive research to see what's out there (describe how/what this looks like), and if an existing package could be modified to meet your needs, tell the maintainer what you're thinking and see if they are open to contributions or have advice for you. I'm not sure I've seen the step "make sure you're writing something that intentionally adds to what exists" in any of the package dev't tutorials I've read.

Either of these could become blog posts.

I know people have feelings, time, and professional cred wrapped up developing their packages. That suggests that it will be easier to head off collisions early on, rather than after packages are being used in the field.

haozhu233 commented 7 years ago

Yes, I feel like this is exactly the same reason why school teacher always ask students to write a lit review part in their term paper. 😂 This topic is definitely worth for discussion. Also, it echoes well with @seankross's question about package duplication in the #random channel a few days ago.

A few questions/thoughts:

For github projects, is there a way for us to better utilize the "tagging" system?
In the process of writing a review paper, people usually do a search in places of pubmed or google. Can we do similar things for CRAN description? Does it worth the effort? Maybe some helper functions depending on tools::CRAN_package_db() in R 3.4.0 (see Mining CRAN Description) that does keywords search might be helpful?
It might be nice if places like CRAN task view or awesome-R can also list these reviews when necessary.

richierocks commented 7 years ago

I think that PCA might be a good topic to try and tackle. It's a common technique, and there are dozens of packages related to this task. A quick trawl of CRAN gave me this list.

bpca Biplot of Multivariate Data Based on Principal Components Analysis
cpca Methods to perform Common Principal Component Analysis (CPCA)
fpca Restricted MLE for Functional Principal Components Analysis
FPCA2D Two Dimensional Functional Principal Component Analysis
FusedPCA Community Detection via Fused Principal Component Analysis
Gmedian Geometric Median, k-Median Clustering and Robust Median PCA
gPCA Batch Effect Detection via Guided Principal Components Analysis
hdpca Principal Component Analysis in High-Dimensional Data
icapca Mixed ICA/PCA
irlba Fast Truncated SVD, PCA and Symmetric Eigendecomposition for Large Dense and Sparse Matrices
logisticPCA Binary Dimensionality Reduction
MetaPCA MetaPCA: Meta-analysis in the Dimension Reduction of Genomic data
MFPCA Multivariate Functional Principal Component Analysis for Data Observed on Different Dimensional Domains
mpm Multivariate Projection Methods
nsprcomp Non-Negative and Sparse PCA
onlinePCA Online Principal Component Analysis
pca3d Three Dimensional PCA Plots
pcaBootPlot Create 2D Principal Component Plots with Bootstrapping
pcadapt Fast Principal Component Analysis for Outlier Detection
pcaL1 L1-Norm PCA Methods
pcaPP Robust PCA by Projection Pursuit
pcdpca Dynamic Principal Components for Periodically Correlated Functional Time Series
rospca Robust Sparse PCA using the ROSPCA Algorithm
rpca RobustPCA: Decompose a Matrix into Low-Rank and Sparse Components
semisupKernelPCA Kernel PCA projection, and semi-supervised variant
sGPCA Sparse Generalized Principal Component Analysis
SpatPCA Regularized Principal Component Analysis for Spatial Data
SPCALDA A New Reduced-Rank Linear Discriminant Analysis Method

I think the workflow for a topic cleanup is something like the following.

Get metrics on each package, like is it being maintained, does it have tests, vignettes and examples. This should be mostly automatable.
Do a deeper review of each package by actually using it. The reviewer would aim to get a sense of what the package is for, its quality, and make a list of features. This is very time intensive, but parallelizable across people.
Aggregate all the reviews to get a sense of the scope of existing work. Identify overlaps and gaps.
Determine an optimal ecosystem of packages. That is which packages need merging/splitting/developing further/killing off.
Contact existing package maintainers to see how enthusiastic they are about participating in the plan.
Repeat steps 4 and 5 until you have a solid plan.
Update the packages.
Promote the new ecosystem!

maelle commented 7 years ago

I really like the review idea. At some point I wanted to do that for packages for analyzing accelerometer data because it's such a mess, impossible to know which package implements which algorithm/deals with a given device, but there seemed to be overlaps. I agree that the task is parallelizable once some questions are decided among reviewers for the field (to make some sort of table at the end).

jimhester commented 7 years ago

One idea is to use similarity matrix similar to pubmed's for calculating similar articles. Basically it calculates a weighted matrix of articles based on the words in the abstract, with the words negatively weighted by their occurrence in the full corpus.

We could do the same thing with package titles and DESCRIPTIONs, or if you wanted to be crazy symbols / variable names used in the source code (probably too noisy to be useful though).

rmflight commented 7 years ago

That's a neat idea @jimhester , although you would probably have to extend to the contents of the function.Rd as that is often where the bits of information that set a package / functionality apart would be, IME. But would be very interesting to see how well a semantic analysis could aggregate / differentiate sets of packages.

jimhester commented 7 years ago

Tokenizing the full man/ documentation is a good idea I didn't think of and would likely work better than trying to do the same with the code.

rmflight commented 7 years ago

Right, function names and implementations may be completely different, but describing purpose and/or methods should illuminate true diss/similarity

noamross commented 7 years ago

I'm not sure it makes sense to try to merge them, but I think a topic ripe for a "lit review" (maybe more like what is being described for tables in #69), is assertion packages. Off the top of my head, these include assertthat, testdat, assertive, ensurer, assertr, validate (plus some associated packages), and checkmate.

ateucher commented 7 years ago

@noamross add to that list datacheckr

stephlocke commented 7 years ago

And Rich Iannone's pointblank

lshep commented 7 years ago

I really like the review idea as well. I am running into the same issue and concern with Bioconductor packages and we wanted to try to come up with a solution.

noamross commented 7 years ago

The R Journal has categories for both Reviews and Proposals and Benchmarks and Comparisons . A publication there might be one outcome of such a project, though probably not the only one if you want to maximize discoverability.

AliciaSchep commented 7 years ago

Perhaps a meta-project could be creating an overview of how to create a good review of similar packages:

How should such a review be organized?
Where should it be hosted? (Github? a journal?)
What kinds of comparisons should get included? (speed? performance? syntax?)
How to update when new packages become available?
How should the review itself be reviewed?

One thought is that perhaps rOpensci or something similar could be a home to his -- similar to the onboarding for packages, there could be onboarding for Git Repos which present a review of similar packages. Compared with a traditional journal, the advantage would be that the repo could get updated as things change, and the review would be in the open

richierocks commented 7 years ago

@AliciaSchep

How should such a review be organized?

rOpenSci already has guidelines for how to review a package. See the Reviewing Template and Reviewing Guide.

Hopefully that guidance should be reusable in this context.

sfirke commented 7 years ago

As the issue opener, I'm summarizing the thread; please chime in with additions or things I missed

In summary: seems like there's general interest in the problem of package coverage, specifically redundancy and gaps when multiple packages address the same topic(s).

One approach to solving the problem: a review of packages in a topic area, possibly a mix of automated info-gathering (as proposed by @jimhester and @rmflight) and human effort (along the lines of what @richierocks lists above). (While this thread has taken a turn toward reviews as the most actionable step, if people have other ideas for combating redundancy, they're welcome.)

For conference time: I think we could build out more concretely what a topic review would entail, like what @AliciaSchep proposes, and then take it for a spin and produce example(s) (perhaps coordinating with the group working on #69). We could even use that test drive experience to inform revisions to the topic review process.

mpadge commented 6 years ago

Better late than never: @njtierney, @lucymcgowan, @kbenoit and yours truly are developing flipper. (A result of useR rather than unconf, but hey.) It's unashamedly and directly inspired by the papr shiny app that gives a tinder-type of interface for biorxiv manuscripts. flipper will be same thing for R packages from CRAN, bioconductor, and hopefully github. Works via exactly the kind of textual similarity matrix @jimhester described (breathtakingly easy with @kbenoit's quanteda package). Not much to see yet, but watch this space. No, not this one, that one.

stefaniebutland commented 6 years ago

"Combining the two issues, we set out to to create a guide that could help users navigate package selection, using the case of reproducible tables as a case study."

Repo: https://github.com/ropenscilabs/packagemetrics Blog post: packagemetrics - Helping you choose a package since runconf17

ropensci / unconf17

Avoiding redundant / overlapping packages #78