What do researchers search for when looking for code repositories?

lukecoy commented 8 years ago

From README

This project’s objective is to create an open source web dashboard capable of searching multiple code hosting services for the benefit of the research community

Here are a couple of questions to start the discussion about what would make a Software Discovery Dashboard most useful for researchers:

What are valuable search criteria when attempting to discover code repositories?
What kind of information would researchers find necessary (or just helpful) in search results?

versae commented 8 years ago

It would be great if the Software Discovery Dashboard included options to search for reference implementations of published papers, by looking up the authors names, DOIs, or titles of the paper.

License and language would also be interesting.

pdurbin commented 8 years ago

@acabunoc asked me to repeat what I said in #sciencelab that https://dataverse.harvard.edu has a fair amount of R code. Stata too.

okdistribute commented 8 years ago

What are valuable search criteria when attempting to discover code repositories?

I've heard that that searching for tables by 'data type' is useful, but doesn't really exist. This would require better practice around schema creation and publishing alongside raw data, though.

What kind of information would researchers find necessary (or just helpful) in search results?

file size, estimated download time, # of rows/files, keywords. Inspired by opendatacache.com by @talos

yarikoptic commented 8 years ago

I usually "apt-cache search" first to find software at my fingertips. If not present there -- then google it up. And then in neurscience/neuroimaging domain there are NIF (http://www.neuinfo.org/) and NITRC (http://nitrc.org) which collate/host various related software projects. Google at times leads me there ;)

As for software implementing some publication/method -- we have plans (not sufficient force yet) to add centralized reporting to duecredit (https://github.com/duecredit/duecredit/) so later you would be able to find software implementing some referenced publication

pdurbin commented 8 years ago

@arfon is thinking about the related area of software citation: https://twitter.com/arfon/status/628504262121816064

mbjones commented 8 years ago

Re-usable packages from CRAN, PyPI, etc. are one thing. The actual scripts researchers write and use in analysis are another. People are now archiving analytical code in R, Matlab, and other languages into various data repositories such as the KNB and FigShare as part of their archived data packages. Here's an example of such a package with R code, which has very minimal metadata about the software.

For this type of code in the KNB (and DataONE) it would be useful to be able to search for software used in analyses based on a classification of the types of analysis that was done, on who created it, in which papers it was used, etc. Some (idiosyncratic) example queries researchers might want would include:

What software was used to produce the results from the paper with identifier {DOI}?
Which derived products (data sets, figures, etc) were created using {analysis type}?
- example analysis types: MCMC, logistic regression, ANOVA
What software was used by researcher {name or ORCID}?
What software can process data from {format} to {format}?

schae234 commented 8 years ago

For us (computational biologists) at least, most of the time it's method driven. We want to answer such and such and heard that method X was a good. Or that method Y overcomes difficulties that method X does not. The starting point is then literature based and we just hope that the code is available somewhere online.

I imagine a useful dashboard for computational biologists might contain topics broken down by methods and then by implementation. E.g --

GWAS
- Mixed Models
- Plotting
- ...
NGS
- aligners
- RNASeq
- ...
read mapping ...

zmughal commented 8 years ago

You might want to also take a look at this idea from the Scholar Ninja project http://juretriglav.si/discovery-of-scientific-software/ which recommends scientific software while browsing GitHub by extracting software citations from papers.

blahah commented 8 years ago

I have three routes to finding relevant software:

To do a particular kind of analysis, I go in search of the right tool. In this case, I read the literature first. Then I read blogs, forums, BioStar, and search Twitter. And I ask people whose depth of knowledge I respect.
Something comes to my attention passively (via mention on twitter, someone starring a repo on github, it reaches the front page of Hacker News, etc.)
Doing something non-scientific, or not specifically scientific. In this case I actually search for packages or code. Usually on rubygems, npm, sometimes github, or sometimes google by combining keywords about the language with keywords about the functionality I want.

Actually very rarely will I search for scientific code, because unless it is some sort of general utility or plumbing, I care first about whether the underlying method is good, then about whether it is implemented well.

There are many sites which attempt to categorise or provide search of scientific software, but mostly they are much harder to use than google.

schae234 commented 8 years ago

@Blahah, we are mainly driven by method also, you succinctly summarized our approach in your post. Curious, what is your main 'branch' of research? We are mainly genetics and systems biology. I'm wondering if work-flows differ much between disciplines? Do the physical sciences have organizational approaches the biological sciences don't?

blahah commented 8 years ago

@schae234 computational biology / genomics here, so we overlap considerably I would think.

npch commented 8 years ago

Some initial thoughts:

Does it work with files of format XXX?
Does it implement important-algorithm-in-my-field XXX?
Does it work on platform XXX? (Where XXX is increasingly R, Galaxy, etc.)
What's the license on the code?
When was it last updated? (For some value of "freshness")
Is there an associated paper showing off scientific results produced using the software?

Also, I had more general thoughts about this area in the following two articles:

amb8805 commented 8 years ago

Thanks everyone for the input, it helps give us more context and an idea how to approach the problem. Please watch for new issues as we learn more and could use more informed input.

@Blahah what kind of overhead is there with the existing research software search services that makes them hard to use? Could you give an example?

bunnybooboo commented 8 years ago

Comparative dashboard suggesting similar tools. Information surrounding licensing, open/proprietary/free, update activity, github repo, programming language, API, gallery of examples/use cases, data footprint, minimum spec, ratings, frameworks that also incorporate this tool, automation possibilities.

mozillascience / software-discovery-dashboard

What do researchers search for when looking for code repositories? #1