Exploratory Survey Paper on Projects using MyBinder

arnim commented 4 years ago

We had the idea in the April meeting to write a survey paper on the projects using MyBinder building on the MyBinder.org Events Archive. This could complement the user survey #257 and be a great chance to highlit the value of mybinder.org

betatim commented 4 years ago

https://discourse.jupyter.org/t/a-datasette-of-mybinder-org-launches/3719 has some example queries for https://binderlytics.herokuapp.com/ which is a SQL DB of every launch event in the archive. It has basic charting and full text search, so might be useful for some quick exploring/getting ideas of what kind of things to look at.

One question I have is: What does a survey paper look like? and What is a survey paper? Do you have a link to a (well written) example so I can get a feeling for what the content looks like?

arnim commented 4 years ago

One question I have is: What does a survey paper look like? and What is a survey paper? Do you have a link to a (well written) example so I can get a feeling for what the content looks like?

What I would be interested in is to learn more about How people use myBinder - a classic example for twitter is What is Twitter, a Social Network or a News Media?

We/I would do some descriptive statistics (e.g. time-series, power-laws, num of revisions, ..) based on the archive, but I would also like to look inside the repos themselves (e.g. do they still launch?, language distributions, maybe some topic modeling...). If it is possible we could connect or compare this to the results of the Feb survey #256 some of us did something similar with Wikipedia.

What do you think is interesting? If no one is protesting I would drive the work but I would certainly hope to win some co-authors ;)

betatim commented 4 years ago

Thanks for the links to the papers. (It is quite weird to read a paper from 2010 about "this new service twitter" knowing what we know now in 2020 :D )

Unsorted list of things I'd be interested in finding out:

what different "types" of repo are there (some types I know exist: documentation/demo, workshop, lecture course)
- does each type of repo have a "fignerprint" that lets us automate the classification?
how much do repos change? If they change what is it that changed (dependencies or just content or both or...)?
how long do people run a binder that they launch? Does it depend on the type of repo it is?
what are the "weird things" people do on mybinder.org?
which institutions/universities/companies use mybinder.org?
which features of repo2docker do people actually use?
what is the "oldest" repo we can find that still works (time between two launches of the same revision that both work)
what does the distribution of the number of launches look like? Does it change over time?
- a histogram where bin 1 = number of repos that were launched once, bin 2 = number of repos that were launched twice, etc
how many repos were made to work with binder/edited to be compatible and how many "just worked"?

arnim commented 4 years ago

I think these are all very interesting things. I'm currently planning to invest about 1 day per week and would report on the progress from time to time in the team meetings.

Would it be a good thing to already open up a latex document? What should we use? At GESIS we use overleaf.com but I'm not sure if this feels too private for what we do. Any suggestions?

sgibson91 commented 4 years ago

I'd maybe recommend working in hackmd to collaborate on the text and then paste into the latex template closer to when we'd like to publish?

arnim commented 4 years ago

Since we seem to have (with hackmd) a clear winner I have started https://hackmd.io/@q0BUWrcJSjagQWgAEurVOA/Bk4AoHEKU/edit I will collect in hackmd some thoughts - feel free to add yours as well ;)

Maybe we switch later (as soon as we know who wants to be in) to some LaTeX env.

arnim commented 4 years ago

@Vildeeide and @minrk s work on project #303 seems to be highly relevant for How reproducible are repositories? maybe we can do something together or at least learn something from them ;)

betatim commented 4 years ago

Aside: Thanks to bitnik's PR the analytics archive (in the latest version) now contains a resolved ref in addition to the unresolved ref. An example: https://binderlytics.herokuapp.com/binder-launches/binder where the latest launch events have master and 1234124124. Might be interesting for finding out how often a repository changes and such.

minrk commented 4 years ago

Reposting from gitter for better record: we can use git (or the gh api) to make a decent guess at the resolved ref based on launch time. This is not rigorous, since git allows rewriting history, but should be good enough to serve our purposes if we feel the need to resolve refs in the past to see which commits worked and later stopped working (for example).

bitnik commented 4 years ago

thanks @minrk ! Actually I thought about using git history, but then I was not sure how to do it for the other repo providers and also because git history can be rewritten as you already mentioned. Then I decided to make that PR to make this info easily available. But now I am questioning if it was good idea to have an extra column in long term and I am not sure.

minrk commented 4 years ago

I definitely think adding the ref was a good thing to do! I was mainly thinking of if someone wanted to use the new column but also go back further in time before the column started being populated, a reasonable guess could be made most of the time with git data (probably easier with git than the GitHub API, but of course requires a git clone).

bitnik commented 4 years ago

btw here is the repo that we are working in for this project: https://github.com/gesiscss/binder_paper_20

any kind of support is very appreciated :)

minrk commented 4 years ago

Great! We've been testing with https://github.com/minrk/repo2docker-checker to build repos and are going to collect some of our preliminary results in a poster for JupyterCon.

minrk commented 4 years ago

Our preliminary results sampling repos with repo2docker are coming together here. Notable differences from the proposed study based on the Binder Analytics Archive is that we were digging into the "automating existing community best practices" idea, which means sampling repos that don't already use Binder. For that, we used this dataset to specify subsets of repos to sample which have already been tested by other means. Some highlighted conclusions:

We found very few repos with R or Julia notebooks that triggered installation of R or Julia (actually zero for Julia) that were not created specifically for Binder. This suggests that our R and Julia specifications are not in fact community practices adopted outside existing Binder users.
repo2docker picking latest Python by default is responsible for a lot of failures because there is not a standard, widely adopted way to specify Python versions. Pipfiles support this, but are not widely used. Lots of folks are being prudent by strictly pinning a lot of packages in requirements.txt, but this is guaranteed to break eventually if Python itself is not also pinned, unless repo2docker changes how it picks the Python version to install. numpy and pandas, in particular, were commonly pinned to versions that work with Python 3.6, but would not build on 3.8.

Leading to the (as yet untested) hypotheses that we could greatly increase successful env creation if we modify how we pick what runtime(s) to install:

pick the default Python (or other runtime) version based on the last commit date. I think we should do this.
look in notebook metadata for runtime and runtime version info (this may be going too far)

We haven't tested these for whether they really do increase the success rate, but based on the failures we have seen, I would be surprised if they do not.

sgibson91 commented 4 years ago

We found very few repos with R or Julia notebooks that triggered installation of R or Julia (actually zero for Julia) that were not created specifically for Binder. This suggests that our R and Julia specifications are not in fact community practices adopted outside existing Binder users.

I'm slightly sad that this didn't come in time to make it into my JuliaCon slides that I published yesterday, but this would be a great thing to discuss during my Birds of Feather session! :raised_hands:

minrk commented 4 years ago

The big caveat to our results, especially for Julia, is that we used the repo search results of the earlier study, which was last updated in 2018 and the Julia project.toml work is pretty recent. We may get different results in project.toml adoption has picked up in the last 18 months. The reproducibility study actually has the code to "resume" their github data collection, but we haven't done that.

choldgraf commented 4 years ago

I apologies for being "that guy" all the time in these threads, but is there any way that we could use the results here to make a pitch for funding for Binder? Maybe as a part of EOSS? (https://github.com/jupyterhub/team-compass/issues/320)

arnim commented 4 years ago

Having something to showcase the value in numbers & graphs was the one of the ideas behind it.

arnim commented 4 years ago

Hi all,

do you think we should have a short meeting and discuss how we can better collaborate and what parts we should prioritize? Would Thursday at 13:00 UTC or 16:00 UTC be good? What would best work for you?

choldgraf commented 4 years ago

Are you talking about for this paper specifically? I am happy to join a call if I'm available, but I probably shouldn't be a blocker on this because I'm expecting a kid any day now 😬

arnim commented 4 years ago

I think the question for me was what things could help for #320 & until when? However, (unfortunately too late) I've seen that @minrk is on a well-earned vacation but it would certainly be great to hear his advice. Maybe we should postpone this until after 8/7 ...

choldgraf commented 4 years ago

well, if we're prioritizing for the CZI proposal then we need to come up with something before August 4th :-)

I think the main things to prioritize are any information that meshes with CZI's mission, things like:

Biological sciences usage
Scientific reproducibility / openness
Teaching and collaboration
Impact across diverse communities (e.g. outside of north america / europe, non-fancy institutions, etc)

arnim commented 2 years ago

https://github.com/atrisovic/dataverse-r-study could be interesting for this

jupyterhub / team-compass

Exploratory Survey Paper on Projects using MyBinder #277