Closed mrzv closed 5 years ago
The dictionaries depending on hash could be solved by replacing:
for k in d:
print(k)
with
for k in sorted(d):
print(k)
There's also collections.OrdredDict
which has all the methods of a dictionary but remembers the order that things were added. OrderedDict.items()
returns things in that same order every time, which solves this problem when the keys are not sortable, like graph vertices or pointers.
There's also an orderdset
package [https://pypi.org/project/orderedset/](on PyPI here) which does the same thing for sets.
The general default behavior for networkx
is to return dictionary views of collections of nodes. Nodes are usually labelled with either integers or strings, but can be labelled with anything hashable, including non-sortable things. So typically sorting the keys of a networkx
return will be fine, but not in the case that the user decides to label their nodes with some kind of incomparable datatype.
For the third bullet point, we might be more successful giving documentation/instructions on how to use tools like conda or pipenv for reproducible environments. I think versioning all of the user's dependencies used might be out of scope for the GerryChain library.
Storing the random seed in MarkovChain
is a great idea, on top of the gerrychain.random
stuff already discussed.
I implemented Zach's idea of setting the random seed in our own gerrychain.random
in #257 and it seems that this, along with fixing the PYTHONHASHSEED environment variable before running, is enough to make runs repeatable (at least in a small grid example). I added a test case in #257 to demonstrate this.
I think one of the next steps is to add a "How to make GerryChain runs repeatable" page in our docs describing this PYTHONHASHSEED thing and discuss how to make environments reproducible.
I flipped through some Python documentation and dict()
maintains insertion order in version 3.7. This is not the case for set()
.
Yeah, setting the hash seed seems unavoidable.
For the third bullet point it's worth investigating what's out there. Actually, this is true more broadly. There is a lot of work on reproducible (computational) science, and it makes sense to take advantage of existing tools.
Just as an example of an interesting solution: https://www.reprozip.org/
I merged in my repeatable runs PR #257 . Next I want to add documentation with instructions for repeatable runs and links to existing tools.
Now that we have documentation too, I'll close this.
It would be useful in lots of situations to be able to have reproducible runs of the chain. This issue is to collect what needs to be done to have this:
MarkovChain
(#252)What else?
Solving this might take care of #191.