Provenance Tooling - Githubissues

MilesMcBain commented 7 years ago

I see this has been discussed a little bit at previous unconfs and fell down a bit of a rabbit hole of data diffing.

I'm motivated to start a discussion on this based on my own recent experience where a team with the best of reproducibility intentions ended up getting ourselves in a mighty tangle with dataset versions, modelling results versions and the matching up of the two.

Based on my own experience, the major things I would like out of a provenance tool are:

Dataset validation
- Does my data match some repository/datastore version?
Results provenance
- Results output contains a key to some kind of execution record that maps user, timestamp, data, code, and results.
Analysis record keeping
- What analyses have thus far been run, on what datasets, and how did they score? In teams this can easily become unclear!
Issue detection
- Alert me when something funny is going on. E.g. Data validation fail, uncommitted code changes are present, my results got worse.

What about the rest of the group, what are your key features etc? Which are the most critical?

For more context, I've been collecting some thoughts about my dream tool in this repository: https://github.com/MilesMcBain/journalr/blob/master/Journalling_tool_proposal.Rmd

jsta commented 7 years ago

Great idea of using Dat! I would love to see greater adoption.

Apart from Dat, do you think it would be very difficult to support remote dataset validation for a variety of remote repository/datastore types? I wonder what ideas if any could be ported/incorporated from the datastorr package (https://github.com/ropenscilabs/datastorr).

noamross commented 7 years ago

This relates a bit to my thoughts of regularly built testing outputs in https://github.com/ropensci/unconf17/issues/5#issuecomment-288221510 . I'd think that the provenance reports would be built and put in the same place.

MilesMcBain commented 7 years ago

@jsta I've seen datastorr. I've not had data small enough to use it. I think a lot of thought has gone into the design of its API and that could be useful.

@noamross Yes, 'testing' that includes data validation and model diagnostics are very much things I think I want.

noamross commented 7 years ago

I would be curious if anyone has had experience using http://empiricalci.com/ and if it has solved any of these issues.

noamross commented 7 years ago

Just saw a talk on another such system, Pachyderm. The short version: It stores both data sets and analysis steps/scripts in versioned environments and stores outputs with the provenance of the unique combination of data/script versions. Seems like it may make sense for the large data case, maybe not for smaller projects.

bzkrouse commented 7 years ago

Cool idea! I'm really interested in the concept of analysis record keeping, and would love a way to better organize the "story" of a team-based effort. I work on collaborative projects that take many paths over long periods of time, and think it would be great to have a system that can help track/report the different decisions and tried analyses as the project evolves. I also think it would be nice to have some sort of tagging system or metadata collection that can be integrated to existing workflow.

Something that may be related to your idea and worth exploring is the repo package.

MilesMcBain commented 7 years ago

Yes @bzkrouse, regarding analysis record keeping, this has to be the facet that I can really see the most potential for. Anthony Goldbloom's talk from UseR2016 Had a big influence on my thinking in terms of the types of meta-analysis you can do when you have good records of an anlysis evolving over time.

MilesMcBain commented 7 years ago

This issue is broad in scope and has not coalesced into anything tractable for the Unconf. I am going to close now as I plan to direct effort to other issues.

I am still very interested in the two main things we discussed here: Data validation and analysis metadata capture. I look forward to discussing these with the Unconf crowd.

ropensci / unconf17

Provenance Tooling #23