sparcopen / open-research-doathon

Open Research Data do-a-thon in London & Virtual - March 4th & 5th
Other
37 stars 12 forks source link

Map out the workflow from data collection to interactive and reproducible data publication #26

Closed npscience closed 7 years ago

npscience commented 7 years ago

[//]: # "======================= Instructions for posting issues: (1) Write your post below using Markdown (as per https://guides.github.com/features/mastering-markdown/ ) or just plain text. (2) Add the appropriate labels to your post, e.g. 'idea' or 'lightning talk' (or both for idea pitches), 'interests', 'skills' or 'experience'. Use 'help wanted' as needed. (3) Don't worry about these introductory lines - you can leave or delete them, as they won't display anyway (you can check this via Preview). (4) Hit the 'Submit new issue' button. ============================"

I'd like to offer a lightning talk on an idea for how a journal could display open research data in a more engaging and useful manner for researchers, including a showcase of a few known tools already available to bring this idea to light. The aim is to inspire attendees to answer/address the following questions, perhaps collating resources in a wiki:

Daniel-Mietchen commented 7 years ago

Thanks — I've added the lightning talks tag. Since this does go deeper and has potential for becoming the focus of a team at the event, I also added the idea tag.

bmkramer commented 7 years ago

Sounds good! Some resources that might be useful for work on this:

-- raw dataset Zenodo or Kaggle -- dashboard -- tool combinations (interactive Google sheet)

More general info on the project these are all results from: 101 Innovations in Scholarly Communication

Can't really contribute today/tomorrow due to other commitments, but will try to check in at some point!

npscience commented 7 years ago

Thanks Bianca - super helpful!

goodwingibbins commented 7 years ago

I'm really interested in this! Been wanting to make some pathways for the openly-available climate data (https://www.esrl.noaa.gov/gmd/ccgg/trends/) to be turned into takeaway arguments/points about climate change, hopefully to remove the "us versus them" mentality of "scientists say this, so you should/shouldn't trust it blindly".

One issue that comes up with things like Jupyter etc is differentiating the way computers need to be spoken to and what the audience needs for transparency.

One possible connection is the ideas here: http://worrydream.com/#!/LearnableProgramming

edsaperia commented 7 years ago

I am currently collecting data that I want to publish effectively, so I guess I'm a user of this project? Is that helpful?

npscience commented 7 years ago

@edsaperia Yes, user insight will be crucial.

Do you know of any tools already that you would use to illustrate your data collection and analysis steps, so that when you come to publish, readers can see what you've done and perhaps give it a go themselves?

Alternatively, would you like to play with any of the tools listed at https://github.com/sparcopen/open-research-doathon/blob/master/reproducible_open_data_resources.md and see what you think of them? Would they work for you? Why (not)?

npscience commented 7 years ago

@goodwingibbins This is totally on point. The trouble with lots of these tools is that they are geared towards the programmer-user. If there's a way to adapt them for people less comfortable with the lingo, that would be very useful.

To start:

npscience commented 7 years ago

@edsaperia's resources:

1) WikiMedia data visualisation framework is vega: http://vega.github.io/ and https://vega.github.io/vega/

2) Is there a standard taxonomy for methodology in the life sciences? i.e. for a reproducible auditable document from data --> analysis --> visualisation.

For example, there are academics who produce/research these methodologies, e.g. LSE's department of methodology (for social sciences) http://www.lse.ac.uk/methodology/Home.aspx

gpa-smith commented 7 years ago

One issue that we have identified that distances the final data visualisation from the 'core' data set is the retrospective way data management may kick in at the publication stage, attempting to make sense of essentially unstructured data at the end of the process.

  1. Is this a dominant issue for researchers?
  2. Would better data curation tools and resources form a useful solution? e.g. services or platforms offered earlier in the process by institutions, repositories or journals?
  3. Do technical solutions along the lines of Jupyter notebooks help to mitigate these issues, allowing more interactive links from data to final visualisation? Is widening their appeal and use beyond computational science thus a priority?
bmkramer commented 7 years ago

Thanks @npscience for pointing out Vega, didn't know that one! @Daniel-Mietchen: with the Wikimedia Graph extension, another one to add to the Wikimedia workflow? (referencing a separate discussion yesterday)

Further responding to/building on @npscience comment above:

1) Two other tools for possibly bridging the gap between 'clicking & coding':

2) Two systems for standard taxonomy of biomed/life science workflows (but mostly aimed towards experimental stage of the workflow):

Daniel-Mietchen commented 7 years ago

@bmkramer In terms of workflows, I would not add individual MediaWiki extensions, just mention that there are thousands of them that together cover all aspects of many research cycles. I had briefly mentioned (but not shown) one of them yesterday: https://www.mediawiki.org/wiki/Extension:Jmol .

bmkramer commented 7 years ago

@gpa-smith Some thoughts on this, following your 3 questions:

npscience commented 7 years ago

@bmkramer - Thank you, I'll explore these new tools and standards. (Edited March 8 to remove mention of bkramer, incorrect handle)

All - is it worth creating a map of this space? Or an ideal workflow to see where the gaps remain?

gpa-smith commented 7 years ago

@bmkramer - ISA framework is an interesting one; the journal Scientific Data uses ISA-Tab to generate the structured side of metadata for its Data Descriptor articles, which are focused around datasets as opposed to traditional articles that have data submitted as supporting material.

The ability to create machine readable metadata for other article types at earlier stages, or at least to feed into something like the ISA-framework at an end point would be beneficial. We have talked about a similar process for integration between something simple like an excel spreadsheet feeding into a JSON solution like Vega.

The early collaborative working space is a useful area to look at developing, for example https://github.com/jupyter/colaboratory, Google drive integration

npscience commented 7 years ago

Ok. Tomorrow's tasks, for me at least (feel free to add):

fionabradley commented 7 years ago

Does the http://www.nltk.org/ fit in here?

I agree with Bianca that easy tools for the non-coder are essential. Tableau Community is nice but a suite of open source tools is ideal. I just learning Python because it's popular in humanities and social science (along with R) but it will be a long time before being able to do anything useful with it. :)

bmkramer commented 7 years ago

One other aspect of this as a workflow is integration in the writing process. Overleaf and Authorea both (in varying aspects) integrate with Jupyter notebooks, for example, and Authorea works with git-based versioning.

Integration with such a workflow would also allow publishers to stimulate/facilitate reproducible reporting, while not tying that aspect of manuscript preparation/submission to a locked-in, proprietary system*. With preprint services offering similar integrations, focus could be more on publications themselves than on publication venue.

Back to workflows, I also like Kieran Healy's take on the difference between the 'office based' and the 'engineering model' http://plain-text.co

*Elsevier at some point piloted executable papers (again, for computer science only), but then dropped the pilot: https://www.elsevier.com/physical-sciences/computer-science/executable-papers-improving-the-article-format-in-computer-science

npscience commented 7 years ago

Check out the Data Stack at https://blog.liip.ch/archive/2017/02/13/data-stack.html

Tools to consider:

npscience commented 7 years ago

Tasks: [] map out the basic workflow for a researcher, from data collection to publication. Include steps for creating figures from data, that are both interactive and reproducible [] populate the workflow with current tools ^^ this requires: [] knowledge of tools used by life scientists (analyze 101innovations data) @npscience doing this [] understand the input/output file types of each tool [] is the tool non-proprietary? at least: can you output data and analysis script in open standards?

bkramer commented 7 years ago

Hi,

You have inadvertently cc'd me (Brian Kramer brian@mitchkram.com) on this thread. 

-b

Sunday, March 5, 2017, 06:19 -0500 from Bianca Kramer notifications@github.com:

One other aspect of this as a workflow is integration in the writing process. Overleaf and Authorea both (in varying aspects) integrate with Jupyter notebooks, for example, and Authorea works with git-based versioning. Integration with such a workflow would also allow publishers to stimulate/facilitate reproducible reporting, while not tying that aspect of manuscript preparation/submission to a locked-in, proprietary system. With preprint services offering similar integrations, focus could be more on publications themselves than on publication venue. Back to workflows, I also like Kieran Healy's take on the difference between the 'office based' and the 'engineering model' http://plain-text.co Elsevier at some point piloted executable papers (again, for computer science only), but then dropped the pilot: https://www.elsevier.com/physical-sciences/computer-science/executable-papers-improving-the-article-format-in-computer-science — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

npscience commented 7 years ago

What data analysis tools for life scientists use?

Source data: 101innovations DOI Authors: B. Kramer, J. Rosman What is this? 2015-2016 survey of tools used in scholarly workflow

Aim: Identify the common tools used for data analysis by life scientists, to inform the workflow map (mentioned above).

Methods:

Results:

npscience commented 7 years ago

Side comments: Plot.ly is not easy to use. And slow.

HKLondon commented 7 years ago

6 high level workflows for publishing data in more traditional journals: https://docs.google.com/spreadsheets/d/1A_cFRnN6_j5bpUAFqv6Fb14xyJAtjKSg_jf-qCfriiY/edit?usp=sharing

npscience commented 7 years ago

@HKLondon Great! Here's the url for the workflow diagram [WIP] https://www.draw.io/#G0B_a2JekZMrW8Z0xRc3hoR1Baczg

HKLondon commented 7 years ago

Whilst most academic publishers can link to data it seems like very few academic publishers can (easily) publish interactive data sets like the OECD: http://stats.oecd.org/index.aspx?DataSetCode=PDB_LV or integrate interactive figures within the HTML versions of articles (many more examples of interactive PDFs see https://peerj.com/preprints/1594.pdf.

Some examples:

3D visualization (Elsevier)
Animated figures (Interactions)
Interactive graphic (F1000)
Interactive figure (Nature Chemistry)
Publisher produced interactive infographics (BMJ)
Crystallography figures (JOURNAL OF APPLIED CRYSTALLOGRAPHY) Interactive plots (Elsevier)

Might be interesting to survey publishers to find out what the stumbling blocks are... publishers slow to change, few researchers wanting to publish interactive items, complexity of managing these items through submission process, tagging issues in article XML/JATS files, problems with platform integrations, long term archiving issues - including problems with submission of files to PubMed Central, etc...

pherterich commented 7 years ago

Remembered this a bit too late, but there was a RDA working group on publishing workflows, but it might be a bit too generic compared to what you're interested in. http://doi.org/10.5281/zenodo.20308

npscience commented 7 years ago

Outstanding: [x] create interactive visualisation of the 'analysis tools that life scientists use' data Files needed are at https://github.com/npscience/open-research-doathon

--> this is happening in my repo at: https://github.com/npscience/open-research-doathon/issues/2

Notes: really difficult for a novice to start using any of the above tools for visualisation....

rossmounce commented 7 years ago

@npscience your comments about plotly befuddle me!

In my experience plot.ly was great to go from data to interactive, configurable visualisations with rapidity. Especially for data layouts I wasn't familiar with e.g. choropleth maps. Ultimately I didn't find it quite had capability to do all the complicated fiddling necessary for "publication quality" figures - I had to dive back into R and do it 'the hard way'.

But for quick, interactive, exploratory data analysis i still find plotly very easy to use - definitely here to stay in my playbook.

npscience commented 7 years ago

@rossmounce noted, the more opinions the better, so thanks for chiming in. I think there's a huge gap in our literacy here; but I'm on the upward learning curve.

rossmounce commented 7 years ago

@npscience having said all that I haven't tried Tableau so maybe Tableau or other such services are even better than plot.ly, but from a standpoint of a user with experience of spreadsheet software, R and plot.ly (admittedly limited experience of the wide breadth of available options!) I can definitely see plot.ly & web services like it have a niche / use-case. If R is one's base reference (as is the case for many biologists?) almost anything else is going to be "easier" & "quicker" !