sparcopen / open-research-doathon

Open Research Data do-a-thon in London & Virtual - March 4th & 5th
Other
37 stars 12 forks source link

Advocacy tools to justify opening up private data sets #41

Closed rfinean closed 7 years ago

rfinean commented 7 years ago

[//]: # "======================= Instructions for posting issues: (1) Write your post (below this text blob) using Markdown (as per https://guides.github.com/features/mastering-markdown/ ) or just plain text. (2) Add the appropriate labels to your post, e.g. 'idea' or 'lightning talk' (or both for idea pitches), 'interests', 'skills' or 'experience'. Use 'help wanted' as needed. (3) Don't worry about these introductory lines - you can leave or delete them, as they won't display anyway (you can check this via Preview). (4) Hit the 'Submit new issue' button. ============================" What are the arguments successfully used by commercial companies to their investors to justify "giving away" their data to the community for free?

C21Beancounter commented 7 years ago

Take a look at: http://theodi.org/the-value-of-open-data

npscience commented 7 years ago

The wealth of use cases for open data are shown here: http://frictionlessdata.io/user-stories/ (and not exhaustive)

npscience commented 7 years ago

How is open data being used in the EU: https://www.europeandataportal.eu/en/highlights/how-open-data-being-re-used-europe

rfinean commented 7 years ago

Like a VC fund spreading their investment amongst hundreds of small companies (with the expectation that most will fail, a few will muddle along but a couple will succeed spectacularly), 'giving away' our data on an 'attribution-only' basis allows thousands of researchers all over the world to use our data to do hundreds of different things, all findable by us because they cite their use of our data. Some of this may well closely match our development goals and is likely to achieve results far quicker than finding, hiring and managing such talent internally.

rfinean commented 7 years ago

After a lot of searching for my specific use-case of 'human physiological observations data' I had to narrow it down to 'in critical care' before finding any data at all. Disappointingly, re3data.org didn't have any vitals measurement data sets. Finally I found http://mimic.physionet.org/ which includes data like that I'd like to publish. I've actually come across MIMIC before in 2012 (see also this API) and am surprised that it is still the only repository of hospital observations data that I can find. MIMIC aspired to migrate to the Observational Medical Outcomes Partnership Common Data Model, which is a 2014 standard for this kind of data (SQL schemas all on Github).

MIMIC's approach to attribution is to ask researchers using their data to cite a key 2016 article in Nature that they wrote announcing the database in all papers published that use the database. That allows us to see the work of those who used the data in their research.

Daniel-Mietchen commented 7 years ago

In case it's useful for #33 or #51, the Wikidata ID of that paper is Q28871995.

rfinean commented 7 years ago

In case it's useful for your tests here is the article's iPython source: https://github.com/MIT-LCP/mimic-iii-paper/

Daniel-Mietchen commented 7 years ago

@rfinean Thanks for the pointer - that is actually on our list (line 78), and I'll dive right into it.

rfinean commented 7 years ago

If we look into older articles about MIMIC-II (from 2011) we can see thousands more citations

Daniel-Mietchen commented 7 years ago

@rfinean @tompollard None of the three notebooks in https://github.com/MIT-LCP/mimic-iii-paper/tree/master/notebooks ran through without error. I only documented the first error for one of them.

rfinean commented 7 years ago

The Open Data Handbook is a good resource for advocating publishing data in a FAIR way

tompollard commented 7 years ago

Hi @Daniel-Mietchen interesting to see this conversation here! Please could you point me to the issue that you had running the MIMIC-III notebook? It certainly was working and I'd like to fix it.

I assume the cause is either (1) updates to packages (2) a result of testing the notebook on the current version of MIMIC (v1.4), rather than the previous version which it was written for.

Daniel-Mietchen commented 7 years ago

@tompollard It's line 78 in this spreadsheet. For background, see https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html .

tompollard commented 7 years ago

Thanks @Daniel-Mietchen. The problem seems to be that the test user was trying to run a Python 2 notebook using Python 3.

All gave
"I couldn't find a kernel matching Python 2. Please select a kernel:"

I guess we could provide a virtual environment of some sort. Adding a requirements file to help with package install would be useful, so I'll try to get around this.

The user would also need access to a password protected dataset, so it's difficult to avoid a small amount of set up.

Daniel-Mietchen commented 7 years ago

Hi @tompollard , the test user in this case was me, and the exercise here was just to see whether notebooks would run through, documenting the first problem if not. In a second run, we will go over the corpus again, document all issues that pop up on the way and — to the extent possible — the steps needed to get the notebooks to run. You are most welcome to join the effort over in issue 25.