Reproducible runs - Githubissues

mggg / GerryChain

Use MCMC to analyze districting plans and gerrymanders

https://mggg.github.io/GerryChain/

Other

132 stars 74 forks source link

Reproducible runs #255

Closed mrzv closed 5 years ago

mrzv commented 5 years ago

It would be useful in lots of situations to be able to have reproducible runs of the chain. This issue is to collect what needs to be done to have this:

[ ] Fixing/accessing/saving seed in the MarkovChain (#252)
[ ] Dictionaries/sets depend on hashing order
[ ] Storing version of GerryChain and all dependencies

What else?

Solving this might take care of #191.

mrzv commented 5 years ago

The dictionaries depending on hash could be solved by replacing:

for k in d:
  print(k)

with

for k in sorted(d):
  print(k)

zschutzman commented 5 years ago

There's also collections.OrdredDict which has all the methods of a dictionary but remembers the order that things were added. OrderedDict.items() returns things in that same order every time, which solves this problem when the keys are not sortable, like graph vertices or pointers.

There's also an orderdset package [https://pypi.org/project/orderedset/](on PyPI here) which does the same thing for sets.

zschutzman commented 5 years ago

The general default behavior for networkx is to return dictionary views of collections of nodes. Nodes are usually labelled with either integers or strings, but can be labelled with anything hashable, including non-sortable things. So typically sorting the keys of a networkx return will be fine, but not in the case that the user decides to label their nodes with some kind of incomparable datatype.

maxhully commented 5 years ago

For the third bullet point, we might be more successful giving documentation/instructions on how to use tools like conda or pipenv for reproducible environments. I think versioning all of the user's dependencies used might be out of scope for the GerryChain library.

Storing the random seed in MarkovChain is a great idea, on top of the gerrychain.random stuff already discussed.

I implemented Zach's idea of setting the random seed in our own gerrychain.random in #257 and it seems that this, along with fixing the PYTHONHASHSEED environment variable before running, is enough to make runs repeatable (at least in a small grid example). I added a test case in #257 to demonstrate this.

I think one of the next steps is to add a "How to make GerryChain runs repeatable" page in our docs describing this PYTHONHASHSEED thing and discuss how to make environments reproducible.

zschutzman commented 5 years ago

I flipped through some Python documentation and dict() maintains insertion order in version 3.7. This is not the case for set().

maxhully commented 5 years ago

Yeah, setting the hash seed seems unavoidable.

mrzv commented 5 years ago

For the third bullet point it's worth investigating what's out there. Actually, this is true more broadly. There is a lot of work on reproducible (computational) science, and it makes sense to take advantage of existing tools.

Just as an example of an interesting solution: https://www.reprozip.org/

maxhully commented 5 years ago

I merged in my repeatable runs PR #257 . Next I want to add documentation with instructions for repeatable runs and links to existing tools.

maxhully commented 5 years ago

Now that we have documentation too, I'll close this.