Map out the workflow from data collection to interactive and reproducible data publication

npscience commented 7 years ago

[//]: # "======================= Instructions for posting issues: (1) Write your post below using Markdown (as per https://guides.github.com/features/mastering-markdown/ ) or just plain text. (2) Add the appropriate labels to your post, e.g. 'idea' or 'lightning talk' (or both for idea pitches), 'interests', 'skills' or 'experience'. Use 'help wanted' as needed. (3) Don't worry about these introductory lines - you can leave or delete them, as they won't display anyway (you can check this via Preview). (4) Hit the 'Submit new issue' button. ============================"

I'd like to offer a lightning talk on an idea for how a journal could display open research data in a more engaging and useful manner for researchers, including a showcase of a few known tools already available to bring this idea to light. The aim is to inspire attendees to answer/address the following questions, perhaps collating resources in a wiki:

what tools do we need to enable researchers to prepare their data for open, interactive, and reproducible publication, from day one?
what tools do researchers already use to collect, analyse, store, and document their data? do they export in standard formats for open data?
where are the blockers in the flow of information? what tools are needed to break these barriers?
what features do readers of research want to see on the journal page? how would they like to interact with the figure and underlying data?
what will help researchers to adopt the open and reproducible data workflow?

Daniel-Mietchen commented 7 years ago

Thanks — I've added the lightning talks tag. Since this does go deeper and has potential for becoming the focus of a team at the event, I also added the idea tag.

bmkramer commented 7 years ago

Sounds good! Some resources that might be useful for work on this:

our database of app. 600 scholcomm tools across the research workflow: http://bit.ly/innoscholcomm-list (mostly online tools, additions welcome!)
survey results showing what (combination of) tools researchers actually use for various aspects of their workflow:

-- raw dataset Zenodo or Kaggle -- dashboard -- tool combinations (interactive Google sheet)

More general info on the project these are all results from: 101 Innovations in Scholarly Communication

Can't really contribute today/tomorrow due to other commitments, but will try to check in at some point!

npscience commented 7 years ago

Thanks Bianca - super helpful!

goodwingibbins commented 7 years ago

I'm really interested in this! Been wanting to make some pathways for the openly-available climate data (https://www.esrl.noaa.gov/gmd/ccgg/trends/) to be turned into takeaway arguments/points about climate change, hopefully to remove the "us versus them" mentality of "scientists say this, so you should/shouldn't trust it blindly".

One issue that comes up with things like Jupyter etc is differentiating the way computers need to be spoken to and what the audience needs for transparency.

One possible connection is the ideas here: http://worrydream.com/#!/LearnableProgramming

edsaperia commented 7 years ago

I am currently collecting data that I want to publish effectively, so I guess I'm a user of this project? Is that helpful?

npscience commented 7 years ago

@edsaperia Yes, user insight will be crucial.

Do you know of any tools already that you would use to illustrate your data collection and analysis steps, so that when you come to publish, readers can see what you've done and perhaps give it a go themselves?

Alternatively, would you like to play with any of the tools listed at https://github.com/sparcopen/open-research-doathon/blob/master/reproducible_open_data_resources.md and see what you think of them? Would they work for you? Why (not)?

npscience commented 7 years ago

@goodwingibbins This is totally on point. The trouble with lots of these tools is that they are geared towards the programmer-user. If there's a way to adapt them for people less comfortable with the lingo, that would be very useful.

To start:

what are the difficulties?
how would we translate them?
is it possible to make a friendlier version or add-on for current tools? Or do we need new ones specifically for the traditional tool user (I'm thinking Excel)

npscience commented 7 years ago

@edsaperia's resources:

1) WikiMedia data visualisation framework is vega: http://vega.github.io/ and https://vega.github.io/vega/

open source, maintained by the vega community, supported by wikipedia
data input as json, can create visualisations, and change colours, etc
ideal workflow might be to offer support to translate from dataset to json (if not already), produce graph to show as static (using form interface), and also submit the code for the graph so that readers can modify as they wish. Author doesn't need to see code.
d3.js is more powerful

2) Is there a standard taxonomy for methodology in the life sciences? i.e. for a reproducible auditable document from data --> analysis --> visualisation.

For example, there are academics who produce/research these methodologies, e.g. LSE's department of methodology (for social sciences) http://www.lse.ac.uk/methodology/Home.aspx

gpa-smith commented 7 years ago

One issue that we have identified that distances the final data visualisation from the 'core' data set is the retrospective way data management may kick in at the publication stage, attempting to make sense of essentially unstructured data at the end of the process.

Is this a dominant issue for researchers?
Would better data curation tools and resources form a useful solution? e.g. services or platforms offered earlier in the process by institutions, repositories or journals?
Do technical solutions along the lines of Jupyter notebooks help to mitigate these issues, allowing more interactive links from data to final visualisation? Is widening their appeal and use beyond computational science thus a priority?

bmkramer commented 7 years ago

Thanks @npscience for pointing out Vega, didn't know that one! @Daniel-Mietchen: with the Wikimedia Graph extension, another one to add to the Wikimedia workflow? (referencing a separate discussion yesterday)

Further responding to/building on @npscience comment above:

1) Two other tools for possibly bridging the gap between 'clicking & coding':

Plot.ly - import data as spreadsheet, graphical interface, web-based, export graphs as images or code, free basic version & paid plans, but open source here: (https://github.com/plotly)
The Gamma - work spreadsheet-based, generate code. Open source (MIT license) Disclaimer: I have not yet really looked at/tested/worked with this.

2) Two systems for standard taxonomy of biomed/life science workflows (but mostly aimed towards experimental stage of the workflow):

Autoprotocol - open standard for specifying experimental protocols
ISA Framework -open source, "helps you to provide rich description of the experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable" (quoted from site)

Daniel-Mietchen commented 7 years ago

@bmkramer In terms of workflows, I would not add individual MediaWiki extensions, just mention that there are thousands of them that together cover all aspects of many research cycles. I had briefly mentioned (but not shown) one of them yesterday: https://www.mediawiki.org/wiki/Extension:Jmol .

bmkramer commented 7 years ago

@gpa-smith Some thoughts on this, following your 3 questions:

Agree that this is a general data-management issue, and as such, should indeed be part of the research workflow from the beginning, not as an add-on at publication (as with data management in general!)
Which is why I also think solutions in the form of tools/platforms etc should be journal- and publisher independent (but journals/publishers should accommodate (and ideally require) e.g. code-based viz at submission).
And yes, I think it would be great if executable visualizations / generating code-based viz would be part of workflows beyond computational science. More researchers that learn to code would help in general (looking sternly at myself, too...), but for more widespread adoption, tools/platforms that make this possible for non-coders would be big help in moving this practice forward.

npscience commented 7 years ago

@bmkramer - Thank you, I'll explore these new tools and standards. (Edited March 8 to remove mention of bkramer, incorrect handle)

All - is it worth creating a map of this space? Or an ideal workflow to see where the gaps remain?

gpa-smith commented 7 years ago

@bmkramer - ISA framework is an interesting one; the journal Scientific Data uses ISA-Tab to generate the structured side of metadata for its Data Descriptor articles, which are focused around datasets as opposed to traditional articles that have data submitted as supporting material.

The ability to create machine readable metadata for other article types at earlier stages, or at least to feed into something like the ISA-framework at an end point would be beneficial. We have talked about a similar process for integration between something simple like an excel spreadsheet feeding into a JSON solution like Vega.

The early collaborative working space is a useful area to look at developing, for example https://github.com/jupyter/colaboratory, Google drive integration

npscience commented 7 years ago

Ok. Tomorrow's tasks, for me at least (feel free to add):

find out more about all of the above.
map out the flow, what is already being done, what is needed, what are the opportunities for each of us to take these forward.

fionabradley commented 7 years ago

Does the http://www.nltk.org/ fit in here?

I agree with Bianca that easy tools for the non-coder are essential. Tableau Community is nice but a suite of open source tools is ideal. I just learning Python because it's popular in humanities and social science (along with R) but it will be a long time before being able to do anything useful with it. :)

bmkramer commented 7 years ago

One other aspect of this as a workflow is integration in the writing process. Overleaf and Authorea both (in varying aspects) integrate with Jupyter notebooks, for example, and Authorea works with git-based versioning.

Integration with such a workflow would also allow publishers to stimulate/facilitate reproducible reporting, while not tying that aspect of manuscript preparation/submission to a locked-in, proprietary system*. With preprint services offering similar integrations, focus could be more on publications themselves than on publication venue.

Back to workflows, I also like Kieran Healy's take on the difference between the 'office based' and the 'engineering model' http://plain-text.co

*Elsevier at some point piloted executable papers (again, for computer science only), but then dropped the pilot: https://www.elsevier.com/physical-sciences/computer-science/executable-papers-improving-the-article-format-in-computer-science

npscience commented 7 years ago

Check out the Data Stack at https://blog.liip.ch/archive/2017/02/13/data-stack.html

Tools to consider:

rawgraphs.io
Apache Hadoop
Tableau
https://python-xy.github.io/

npscience commented 7 years ago

Tasks: [] map out the basic workflow for a researcher, from data collection to publication. Include steps for creating figures from data, that are both interactive and reproducible [] populate the workflow with current tools ^^ this requires: [] knowledge of tools used by life scientists (analyze 101innovations data) @npscience doing this [] understand the input/output file types of each tool [] is the tool non-proprietary? at least: can you output data and analysis script in open standards?

bkramer commented 7 years ago

Hi,

You have inadvertently cc'd me (Brian Kramer brian@mitchkram.com) on this thread.

-b

Sunday, March 5, 2017, 06:19 -0500 from Bianca Kramer notifications@github.com:

One other aspect of this as a workflow is integration in the writing process. Overleaf and Authorea both (in varying aspects) integrate with Jupyter notebooks, for example, and Authorea works with git-based versioning. Integration with such a workflow would also allow publishers to stimulate/facilitate reproducible reporting, while not tying that aspect of manuscript preparation/submission to a locked-in, proprietary system. With preprint services offering similar integrations, focus could be more on publications themselves than on publication venue. Back to workflows, I also like Kieran Healy's take on the difference between the 'office based' and the 'engineering model' http://plain-text.co Elsevier at some point piloted executable papers (again, for computer science only), but then dropped the pilot: https://www.elsevier.com/physical-sciences/computer-science/executable-papers-improving-the-article-format-in-computer-science — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

npscience commented 7 years ago

What data analysis tools for life scientists use?

Source data: 101innovations Authors: B. Kramer, J. Rosman What is this? 2015-2016 survey of tools used in scholarly workflow

Aim: Identify the common tools used for data analysis by life scientists, to inform the workflow map (mentioned above).

Methods:

Uploaded cleaned-data-innovations-in-scholarly-communication-survey_def.csv to OpenRefine. Filtered for LIFESCIENCES responders only. Downloaded: 101innovations-data-npscience.csv
Made array into single column, kept ROLE field associated. Removed all other fields (note to self: next time, keep the ID field to serve as a lookup function).
Added in the free-text tools: for those who listed other tools, analysed the 'ANALYZSPECCL' column data --> included ROLE field with tool specified in free text. Cleaned the free text ('Other') into defined tools ('Other-clean') where possible (see 'tool cleaning categories.csv').
EXCLUSIONS: Some of the free text tools specified tools included as predefined tools in the survey. Checked back into raw data to see where entrant had selected tool and specified it in free text too, to exclude these duplications. These were: --- 7718acf7db511b2d56378405e1c41d34,PhD Student,iPython --- 901081a86fa72f183d9c5c1dea44d339,Professor/...,MS Excel --- 0650e0e9e8c109f7137f9efdd706ed2c,Professor/...,MS Excel --- 31f20ad6aff37b44918bd66ba57eb27a,Professor/...,MS Excel --- 7c63d1938b3a772d53f592f510e3b60a,Professor /...,R --- 0e7f8fc2486739f81b6ac760c6c15cdc,PhD student,R statistics (=R) --- 91c960039bdb23cc2c71b1f11f9c009f,Librarian,SPSS --- 31f20ad6aff37b44918bd66ba57eb27a,Professor/...,SPSS --- 96582f92afa4d69ce1e636d51a927c79,Professor/...,SPSS

Results:

output dataset: '101innovations-npscience-lifesci-analysistools.csv' https://github.com/npscience/open-research-doathon/blob/master/101innovations-npscience-lifesci-analysistools.csv
tool value cleaning data: https://github.com/npscience/open-research-doathon/blob/master/tool%20cleaning%20categories.csv
results data (count of tools total and by role using PivotTable): https://github.com/npscience/open-research-doathon/blob/master/101innovations-npscience-lifesci-analysistools-counts.csv

npscience commented 7 years ago

Side comments: Plot.ly is not easy to use. And slow.

HKLondon commented 7 years ago

6 high level workflows for publishing data in more traditional journals: https://docs.google.com/spreadsheets/d/1A_cFRnN6_j5bpUAFqv6Fb14xyJAtjKSg_jf-qCfriiY/edit?usp=sharing

npscience commented 7 years ago

@HKLondon Great! Here's the url for the workflow diagram [WIP] https://www.draw.io/#G0B_a2JekZMrW8Z0xRc3hoR1Baczg

HKLondon commented 7 years ago

Whilst most academic publishers can link to data it seems like very few academic publishers can (easily) publish interactive data sets like the OECD: http://stats.oecd.org/index.aspx?DataSetCode=PDB_LV or integrate interactive figures within the HTML versions of articles (many more examples of interactive PDFs see https://peerj.com/preprints/1594.pdf.

Some examples:

3D visualization (Elsevier)
Animated figures (Interactions)
Interactive graphic (F1000)
Interactive figure (Nature Chemistry)
Publisher produced interactive infographics (BMJ)
Crystallography figures (JOURNAL OF APPLIED CRYSTALLOGRAPHY) Interactive plots (Elsevier)

Might be interesting to survey publishers to find out what the stumbling blocks are... publishers slow to change, few researchers wanting to publish interactive items, complexity of managing these items through submission process, tagging issues in article XML/JATS files, problems with platform integrations, long term archiving issues - including problems with submission of files to PubMed Central, etc...

pherterich commented 7 years ago

Remembered this a bit too late, but there was a RDA working group on publishing workflows, but it might be a bit too generic compared to what you're interested in. http://doi.org/10.5281/zenodo.20308

npscience commented 7 years ago

Outstanding: [x] create interactive visualisation of the 'analysis tools that life scientists use' data Files needed are at https://github.com/npscience/open-research-doathon

...count.csv
master.R --> RShiny app (thanks to @bjw49 :D) Remaining tweaks for the Shiny: [x] order bars by highest count first (native order in the csv); currently alphabetical [] transpose x<-->y so that tool names are on y-axis [] select roles to display as grouped bars (e.g. show all, PhD and professor bars)

--> this is happening in my repo at: https://github.com/npscience/open-research-doathon/issues/2

Notes: really difficult for a novice to start using any of the above tools for visualisation....

rossmounce commented 7 years ago

@npscience your comments about plotly befuddle me!

In my experience plot.ly was great to go from data to interactive, configurable visualisations with rapidity. Especially for data layouts I wasn't familiar with e.g. choropleth maps. Ultimately I didn't find it quite had capability to do all the complicated fiddling necessary for "publication quality" figures - I had to dive back into R and do it 'the hard way'.

But for quick, interactive, exploratory data analysis i still find plotly very easy to use - definitely here to stay in my playbook.

npscience commented 7 years ago

@rossmounce noted, the more opinions the better, so thanks for chiming in. I think there's a huge gap in our literacy here; but I'm on the upward learning curve.

rossmounce commented 7 years ago

@npscience having said all that I haven't tried Tableau so maybe Tableau or other such services are even better than plot.ly, but from a standpoint of a user with experience of spreadsheet software, R and plot.ly (admittedly limited experience of the wide breadth of available options!) I can definitely see plot.ly & web services like it have a niche / use-case. If R is one's base reference (as is the case for many biologists?) almost anything else is going to be "easier" & "quicker" !

sparcopen / open-research-doathon

Map out the workflow from data collection to interactive and reproducible data publication #26