best practice to reproducibly capture manual input

ropensci / ozunconf17

Website for 2017 rOpenSci Ozunconf

http://ozunconf17.ropensci.org/

24 stars 6 forks source link

best practice to reproducibly capture manual input #11

Open rgayler opened 7 years ago

rgayler commented 7 years ago

This is probably a component of this other issue.

Sometimes input to a process is necessarily manual. For example, I tend to use an augmented data dictionary to drive processing. I will add columns that indicate how I want each of the variables to be processed and often the contents of the columns are just results of my subjective choices.

I currently tend to have a version of the data dictionary per "iteration" of my thinking so that there is a record of each iteration and each iteration is consistent and can be re-run.

I would be interested to discus other/better ways of doing this.

njtierney commented 7 years ago

This might be related to Miles's Data analysis journal concept

rgayler commented 7 years ago

@njtierney

This might be related to Miles's Data analysis journal concept

I think somewhat related. The closest part of Miles' proposal seems to be the "journalled analysis". The analyses are manually chosen and the journalling captures some metadata about the analyses.

I was thinking more in terms of explicitly represented data. For example, I like to create data dictionaries with columns that drive some aspects of the data prep and analysis. I tend to use a spreadsheet as the data dictionary, because it provides an OK interface for editing the data dictionaries, and just keep multiple versions of the data dictionaries as they evolve.

Is it better to have some completely external process for modifying the data dictionaries and just use version control to track the changes, or would it be be better to have some sort of explicit representation of the actions on the data dictionary that can be replayed for reproducibility?

njtierney commented 7 years ago

Along the lines of version control for data, there is the dat project, which @rdpeng might be able to elaborate on.

rdpeng commented 7 years ago

My recollection of the dat project seems a bit different from what it actually is. Seems dat is primarily focused on data distribution via a peer-to-peer protocol.

rdpeng commented 7 years ago

Also, there was a long discussion of data diff/versioning in the 2015 rOpenSci Unconference (with no real resolution, IIRC): https://github.com/ropensci/unconf15/issues/19

rgayler commented 7 years ago

Also, I think schema evolution is an issue. I didn't see this mentioned in my very quick look at the dat documentation.

On 12 Oct 2017 15:38, "Roger D. Peng" notifications@github.com wrote:

Also, there was a long discussion of data diff/versioning in the 2015 rOpenSci Unconference (with no real resolution, IIRC): ropensci/unconf15#19 https://github.com/ropensci/unconf15/issues/19

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/ozunconf17/issues/11#issuecomment-336018557, or mute the thread https://github.com/notifications/unsubscribe-auth/AFKJG1L3jUd8A6pHbRkg8gB7KSQT5oYlks5srZeqgaJpZM4Px3A4 .

timchurches commented 7 years ago

@rdpeng

Also, there was a long discussion of data diff/versioning in the 2015 rOpenSci Unconference (with no real resolution, IIRC): ropensci/unconf15#19

That discussion from March 2015 predates the release of git LFS v1.0 in Oct 2015, and its subsequent maturation. I wonder if it would be worthwhile revisiting the data versioning, provenance, diffing, distribution and syncing issues in the light of git LFS, since it integrates so nicely with git, but permits a wide range of back-end storage options for the data large-file blobs (which can be encrypted, which is important to me as a health researcher dealing with confidential patient data).