swcarpentry / DEPRECATED-bc

DEPRECATED: This repository is now frozen - please see individual lesson repositories.
Other
299 stars 383 forks source link

What should we teach about provenance? #429

Open gvwilson opened 10 years ago

gvwilson commented 10 years ago

It used to be easy: when we taught version control with Subversion, we told people that if they put:

$Revision:$

in a file, and set the file's properties correctly, Subversion would automatically update that string every time the file was changed so that it read:

$Revision: 123$

(or whatever the revision number was). This worked in pretty much any text file, so they could get the version control system to keep track of files' provenance for them. In particular, you could do this in a program (I'll show it in Python, but it works in any language):

my_version = "$Revision: 123"

def main(args):
    version = my_version.strip("$").split()[1]
    print "Results produced by my_program version", version
    do_calculations_and_print_output()

But now we're using Git, and that doesn't work, because Git identifies files using hashes of their contents, and if you modify a string in a file, its hash changes, and if that happens during a commit, it can rupture the spacetime continuum. @jiffyclub wrote a blog post a while back about a workaround, but it's Python-specific, and a bit clumsy compared to the old SVN way of doing things. What can/should we teach people about using the version control system to do these kinds of things?

synapticarbors commented 10 years ago

I haven't actually used it, but there is a nice looking packaged called Sumatra that is designed to do provenance tracking, recording information from the VCS in a database when you run a program, script, etc, using it's interface. I'm not advocating that we teach or recommend this particular implementation, but in case others haven't seen it, it may be useful.

However, taking a step back, even if Git added revision info to the actual file, like SVN, we would really need to talk about some simple form of logging. If the print statements are just going to stdout, then even if you have access to that information, there is no record tying the results to the run. Given revision/version information, one option is to talk about actually recording provenance information in the results file. That's great if you have control of the results format and file type. What happens if you're using Python or bash to script calling some 3rd party program or module that writes it's own files? Then you need to build your own logging infrastructure.

Maybe we can't teach an actual implementation, but we can speak generally about strategies and things to consider when developing an analysis or data generation pipeline.

karthik commented 10 years ago

A related note. I’m working on a R package called git2r (will probably rename the package at some point), that does exactly that. It allows for a git hash to be embedded directly into a document as it’s been parsed (in this case through R). With the R + IPython integration now live, it should be easy to use elsewhere too.

https://github.com/ropensci/git2r

If you're one of the cool kids writing in markdown, you can throw this in the footnotes or the acknowledgements (r markdown_link(), bounded by single backticks) to get this:

> markdown_link()
[1] "[58c507](https://github.com/ropensci/git2r/commit/58c5075082d35e63312cf27425283e8e203b69c6)"
jiffyclub commented 10 years ago

It's also possible to get the HEAD commit hash by looking at files in the .git directory in Git repos. For example:

> cat .git/HEAD
ref: refs/heads/master
> cat .git/refs/heads/master
29c9fc8cfd5dc151dacf7c1e769d3b23eda549b5

That may not be something we want to try to explain to our learners, but it could lead to some more language-independent ideas.

stain commented 10 years ago

With git you can call git rev-parse HEAD to find the current commit. In many ways this is better than the SVN $Revision$ hack, as you will get the revision of the tree, not just that particular file.

Integrated in a python script, you can print out your own revision using subprocess.call (or retrieve it using subprocess.check_output from Python 2.7).

Example of a script that prints its own revision:

#!/usr/bin/env python

import sys, os, os.path, subprocess

def print_provenance():
    cwd = os.getcwd()
    print "directory:", cwd
    print "command:", ' '.join(sys.argv)
    os.chdir(os.path.dirname(sys.argv[0]))
    try:
        print "git commit: ",
        sys.stdout.flush() # Ensure print goes out before the below
        # Temporarily change to 'our' base directory, in case
        # script was invoked from a different working directory
        subprocess.call(["git", "rev-parse", "HEAD"])
    finally:
        # back to original dir
        os.chdir(cwd)

def main():
    print_provenance()
    # TODO: Do the actual work

if __name__ == "__main__":
    main()

Example use (invoked from outside the git working tree):

-bash-4.1$ 727/hello.py hello there
directory: /home/ssoiland
command: 727/hello.py hello there
git commit: 7911dfb753a72bd829df0d8c9dcb11a3c13e417f

You can simplify the git-call if you assume that you run within the git tree, no need to do the directory change then. So you can tutorial wise begin on the command line, then introduce subprocess, then add the directory handling

If you use Python 2.7 or later you can make a function that use subprocess.check_output() and .trim() to return the revision number and avoid the stdout.flush - also making it easier to carry with you the revision to other places than stdout.

abostroem commented 10 years ago

For a project I'm working on we create calibration reference files which are distributed to the community. When we switched from SVN to git we started creating git tags to tag a given commit with the reference file name and when it was made available to the community - we also put this tag in the header of the file.

jkitzes commented 10 years ago

+1 to teaching logging instead of embedding the revision number in the output. This information always seemed to me to be a characteristic of the entire output of a command (documents + aux files + figures etc.), not of a particular file/document, and thus it seems that it should live in a log file associated with all of the output. I can see at least two practical advantages as well -

  1. In a log file we can extract an arbitrary amount of information (as much as a pip freeze if we really wanted to) to record the exact state when we ran the command. Revision numbers are limited to, well, revision numbers.
  2. Teaching logging is a two-for-one in this case, since logging is useful for logging other information in addition to recording version numbers.

Either way, I think extracting the git commit hash is the right way to go, but we need to include a flag that indicates (at a minimum) whether the repo was clean at the time of the command. Otherwise the output won't match the file state at the recorded commit hash if there have been changes since the last commit.

rgaiacs commented 10 years ago

@gvwilson Please correct me if I misunderstanding your question. You want the follow behavior:

$ cat sample
Revision: 123
$ git commit -am 'Some changes'
$ cat sample
Revision: 124
$ git status
On branch master
nothing to commit, working directory clean

I believe that you can accomplish this behavior using git hook. I can't write a proof of concept for it right now.

karthik commented 10 years ago

@jkitzes Nice, thanks for that heads up. I will modify the function to add a flag (or warning) in case repo is not clean.

jkitzes commented 10 years ago

@karthik happy to help. On that note, here's another issue that I came across when I was trying to solve this same problem a few years ago (which I never did, as I couldn't think of a good solution to the issues below).

This problem arises if there is an actual substitution occurring in a file that's under version control itself (like a LaTeX source file), which is then compiled into some output that's supposed to have the version number/hash. In this case, the commit has to happen before the substitution (so that the hash is available), but once the substitution happens, the repo will be dirty again with the only change being to the revision text (since the substitution happened after the commit). This means that you effectively "re-dirty" your repo with every commit.

Question: how does Subversion deal with this (perhaps it calculates the Revision number first, updated the files, then committed everything including the updated numbers)? AFAIK there's no way around this given how the git hashes are generated.

karthik commented 10 years ago

Excellent question, @jkitzes I've been thinking about it too for some time now. But thankfully grant writing, admin work, and other things keep me from focusing on any problem for too long. ;) Will update once I figure out a solution.

jiffyclub commented 10 years ago

Our solutions so far work for code by retrieving the repo state at runtime and embedding it in products (which may or may not be tracked). To do similar for something like LaTeX one would probably need to turn to a tool like dexy that allows you to run code and embed the output in your document source as part of the compilation process.

jkitzes commented 10 years ago

@jiffyclub, so when the repo state is embedded in a tracked product during a run, doesn't this dirty the repo again?

jiffyclub commented 10 years ago

What you want to avoid is having repo state embedded in the product generators so that it's impossible to log the state of the source. So long as the repo state is only embedded in end products I think you avoid the paradox. (But that raises the question of why you're versioning end products in the first place.)

rbeagrie commented 10 years ago

@stain You can call git rev-parse from outside the repository by specifying the --git-dir option. For example:

git --git-dir=/path/to/repository/.git rev-parse HEAD

Of course, this requires knowing the location of the .git directory ahead of time. If the file in question is in the top-level directory of the repository, you can do:

git_directory = os.path.join(os.path.dirname(sys.argv[0]), '.git')

If the file happens to be in a sub-directory though, this won't work. Offhand, I don't know how to get the location of the .git directory for a file at an arbitrary level in a repository, short of walking up the directory tree yourself. This wouldn't be too difficult, but I'm not sure it's necessarily a better solution than just cd'ing to the directory...

FWIW, when I tackled this issue switching from svn to git, I found it a lot easier to log the current commit hash (plus clean or dirty state) of the repository in the program output than to store it directly in the program itself.

DamienIrving commented 10 years ago

Every time I run a script, a time stamp like the following is either be added to the 'history' in the global attributes out the output file (if I'm dealing with a self describing format like netCDF) or to a '.met' file in the case of a figure or an output that isn't self describing. Wed Apr 09 14:19:26 2014: python plot_map.py infile.nc outfile.png --imsize 10 (Git hash: d24ae9c)

This is generated as follows:

import sys
from datetime import datetime
from git import Repo 
REPO_DIR = os.path.join(os.environ['HOME'], 'path', 'to', 'repo')
MODULE_HASH = Repo(REPO_DIR).head.commit.hexsha
time_stamp = """%s: python %s (Git hash: %s)""" %(datetime.now().strftime("%a %b %d %H:%M:%S %Y"), " ".join(sys.argv), MODULE_HASH[0:7])

The only thing I'm not sure about with this solution is how to get python to tell me which installation of python I used on my machine (i.e. instead of just writing 'python' in the time stamp, I'd like it to say if I used /usr/bin/python or /usr/bin/anaconda/bin/python and perhaps also the version of that python installation).

rbeagrie commented 10 years ago

I just found the script I used to call at the top of all my bash jobs. I got out of the habit of using this - should start again...

#!/bin/sh

echo "Checking git repositories"
echo `alias`
for dir in $@
do
    branch_sync=`git --git-dir $dir/.git --work-tree=$dir status | grep -P -o "(?<=# Your branch is )(ahead|behind).*by [0-9]* (commits|commit)"`
    hash=`git --git-dir $dir/.git log --pretty=format:'%h' -n 1`

    if [[ `git --git-dir $dir/.git --work-tree=$dir diff HEAD --cached --abbrev=40 --full-index --raw` != "" ]];
    then 
        echo "$dir is dirty, changes staged for commit";  

    elif [[ `git --git-dir $dir/.git --work-tree=$dir diff` != "" ]];
    then 
        echo "$dir is dirty, changes not staged for commit";  

    elif [[ `git --git-dir $dir/.git --work-tree=$dir ls-files --other --exclude-standard` != "" ]];
    then 
        echo "$dir is dirty, untracked files present";  

    else
        echo "--> $dir - $hash - CLEAN"
    fi

    if [[ $branch_sync != "" ]];
    then 
        echo " ($branch_sync)"
    fi

    echo ""
done

It's invoked with a list of repositories (in my case, they all live in /home/rab11/scripts, so I do ./script_status.sh /home/rab11/scripts/*

Output looks like this:

Checking git repositories

--> /home/rab11/scripts/cufflinks - b06653d - CLEAN

/home/rab11/scripts/dotfiles is dirty, changes not staged for commit

--> /home/rab11/scripts/fusion-analysis - 8b74757 - CLEAN

--> /home/rab11/scripts/rna-chip-comparison - 385786c - CLEAN

/home/rab11/scripts/RnaChipSplicing is dirty, changes not staged for commit

/home/rab11/scripts/seqgi is dirty, changes not staged for commit
 (ahead of 'origin/master' by 5 commits)
rbeagrie commented 10 years ago

@DamienIrving I like the idea of doing this for python scripts - might copy a bit of your code for myself! The path to the current python executable is in sys.executable, and the python version is in sys.version - hope that helps!

jamespjh commented 10 years ago

My recommendation would be to include the git revision number in the executable or distribution during a build step, through a template file. So I would teach this when teaching scons/cmake/setuptools.

Actually changing the source-files themselves on every commit has turned out to cause merge headaches, a quick bit of research suggests, so isn't recommended any more, now that people are branching and merging more casually.

gvwilson commented 10 years ago

Actually changing the source-files themselves on every commit has turned out to cause merge headaches, a quick bit of research suggests... Can you please add a couple of links to what you found?

jamespjh commented 10 years ago

From http://stackoverflow.com/questions/7016300/git-revision-number-in-source-code-documentation

suggestion: don't do it. Keyword expansion could be done with gitattribute filter, as presented in "Git equivalent of subversion's $URL$ keyword expansion", but this introduce metadata within the data, which usually makes merge much more complex that they actually are. You can see in this (lengthy) answer on "What are the basic clearcase concepts every developer should know?" the all debate on "Embedded Version Numbers - Good or Evil?". Unless you have a good merge manager in order to ignore those special values, you get a "Merge Hell". And with Git, as detailed in "How does Git solve the merging problem?", the merge is quite basic. No fancy merge manager.

The answer referenced above is http://stackoverflow.com/questions/645008/what-are-the-basic-clearcase-concepts-every-developer-should-know/645424#645424 and contains:

Early ClearCase Experiment with Triggers - and Merge Hell Our project migrated later, when there was already some experience with CC. Partly because I insisted on it, we embedded version control information in the source files via a check-in trigger. This lasted a while - but only a while - because, as VonC states, it leads to merge hell. The trouble is that if a version with the tag /main/branch1/N is merged with /main/M from the common base version /main/B, the extracted versions of the files contain a single line which has edits in each version - a conflict. And that conflict has to be resolved manually, rather than being handled automatically.

blahah commented 10 years ago

It's not at all clear to me why it's useful to have a revision count inside a source file. The revision count is just a bad headache from svn.

Version control should be cleanly separated from code imho - let git keep track of provenance, which it does perfectly. Don't pollute code with metadata.

The only exception is the version number for the whole program, which is not the same as the sub-minor version revision. Users should only be seeing tagged releases, not using development code.

I think it's especially inadvisable to use a git hook or script to insert the metadata. The hook or script is very unlikely to follow the code everywhere it goes, which will lead to the metadata in the code becoming disconnected from the true provenance.

It's easy to see what modifications were made to a file with git log <filename>. If there is some reason why you need a revision count for a file, just use git shortlog <filename> to get it by user, or git rev-list HEAD --count <filename> to get a total.

jdblischak commented 10 years ago

Here is my simple solution for recording the provenance of my data analyses in R. I think it could be included in an R bootcamp as long as literate programming with knitr is also covered (though I'm not sure it will work for Windows).

I organize my project so that I have separate subdirectories for the code and data, which are separate git directories. I run all my scripts from within the code subdirectory. At the top of each R Markdown file, I have the following lines:

Code version: `r system("git log -1 --format=oneline | cut -d' ' -f1", intern = TRUE)`

Data version: `r system("git --git-dir=../data/.git log -1 --format=oneline | cut -d' ' -f1", intern = TRUE)`

This prints the commit hash for the code and the data at the top of each analysis, which is similar to what others have suggested above. I post the rendered html of the analysis to my electronic science notebook and/or distribute it to collaborators. And if I or someone else wants an old version of a figure created, I can easily roll back both repositories. Also, I report the version of R and any loaded packages used in the session at the end of each file using sessionInfo.

This doesn't solve the more complicated problems of determining if the repo is clean or finding the path to the script's git directory if it isn't already known, but it is a start that should be accessible to learners after two days of learning R, shell, and git.

jamespjh commented 10 years ago

+1 for @jdblischak's solution. The thing that is useful is stamping the report with the revision, not stamping the code with the revision.

cranmer commented 10 years ago

Is the scope of this discussion just git-related provenance tracking In terms of teaching provenance in a swcarpentry setting, I would also just try to give an overview of the concept. This discussion, while very informative, seems fairly narrow in the broader scope of tracking provenance. I imagine that in several situations peoples workflows have components that are not easily under their control or not in git.

Potentially a good reference for an overview and some exposure to the broader issue of provenance is this http://www.w3.org/TR/2013/NOTE-prov-primer-20130430/#intuitive-overview-of-prov

mr-c commented 10 years ago

@ https://github.com/ged-lab/khmer I use https://github.com/warner/python-versioneer + git tags to report an accurate version, either "v1.0" for a release or "v0.8-105-g259a2a5" for the 105th commit after v0.8 with the hash g259a2a5

DamienIrving commented 10 years ago

@rbeagrie Thanks! That makes for a pretty nice solution in Python...

import sys
from datetime import datetime
from git import Repo 
REPO_DIR = os.path.join(os.environ['HOME'], 'path', 'to', 'repo')
MODULE_HASH = Repo(REPO_DIR).head.commit.hexsha
time_stamp = """%s: %s %s (Git hash: %s)""" %(datetime.now().strftime("%a %b %d %H:%M:%S %Y"), sys.executable, " ".join(sys.argv), MODULE_HASH[0:7])
stain commented 10 years ago

On 10 Apr 2014 16:50, "Kyle Cranmer" notifications@github.com wrote:

Potentially a good reference for an overview and some exposure to the broader issue of provenance is this http://www.w3.org/TR/2013/NOTE-prov-primer-20130430/#intuitive-overview-of-prov

I'm glad someone found it useful!

Here are some slides I have about provenance:

http://www.slideshare.net/soilandreyes/20130321-what-can-provenance-do-for-me

(See pptx version for animations)

http://practicalprovenance.wordpress.com for more

Agreed, showing the general principles of Provenance is a good thing. Tools consuming prov are still a bit early days for, but you can at least make a diagram.. ;)