terrapower / armi

An open-source nuclear reactor analysis automation framework that helps design teams increase efficiency and quality
https://terrapower.github.io/armi/
Apache License 2.0
224 stars 87 forks source link

Non-deterministic ordering of printed component attributes #438

Open keckler opened 2 years ago

keckler commented 2 years ago

I am writing ARMI cases programmatically, and then as each case is created, using the writeInputs method to print the blueprints file. I happened to notice that each time the blueprints file is written, the ordering of written component attributes (i.e. material, temperature, dimensions) changes, even when nothing has changed between the cases. Even if I just write the same case twice by executing my scripts twice in a row, the ordering may change.

It would be nice for the ordering to stay the same so that the written input files can be easily diff-ed against each other.

A quick look into ARMI makes me think that this is actually an issue with yamlize, but confirmation of that would be helpful. This seems odd, considering Python now enforces dictionaries to be insertion-ordered. But if it is something within ARMI, my feeling is that it would be beneficial to update, if possible.

youngmit commented 2 years ago

I totally agree that blueprint serialization should be deterministic. even if you dont get a round-trip identical output, the order should decay to some sort of canonical ordering, regardless of what it's based upon. I honestly dont know what's leading to the randomness here. one thing to try just to rule it out would be to try setting your PYTHONHASHSEED environment variable to something to see if that is affecting the order.

john-science commented 2 years ago

Oh, yeah, I much prefer deterministic code. The most obvious solution seems like output keys should be printed in alphabetical and numerical order.

Off the top of my head, I don't know where the problem is, but yamlize seems like an obvious place to start.

keckler commented 2 years ago

The PYTHONHASHSEED thing seems to have worked. If I export PYTHONHASHSEED=666 it becomes repeatable. If I use a different seed, the order changes, but stays repeatable.

Is there an easy way to add a PYTHONHASHSEED to ARMI so that when a user loads their venv the seed is consistent across all ARMI users? Ideally we would be deterministic across users, not just on one machine.

youngmit commented 2 years ago

This is something that we used to do, but stopped when we declared minimum sorry for python 3.6, which switched to deterministic dictionary ordering. However, sets may still be ordered stochastically.... 🤔 Might be that something is getting put into a set. It's that maybe something you are doing in your blueprint generation?

keckler commented 2 years ago

I do not use sets in my scripts that generate the blueprints.

But I did a quick scan through yamlize and found that it uses sets in a number of places:

Just briefly looking at those sets in yamlize, it seems likely to me that that is the source of the randomness.

john-science commented 2 years ago

I have opened a ticket for deterministic ordered in yamlize: https://github.com/SimplyKnownAsG/yamlize/issues/12

And I will probably do the work myself, I'm just not absolutely certain I see all the places changes would need to be made. Yet.

john-science commented 2 years ago

Hey, @keckler, can you send me an example or two of this problem?

You can do it via email, or whatever, so it doesn't have to be cleaned up for public consumption. But I want something to test against. The yamlize project does actually have a lot of tooling for keeping the ordering sound. (Those sets above aren't used in the ordering in the library.) So there are going to be some finer points in yamlize for me to sort out the problem.

Anyway, having working examples to test against would probably help speed up the process.

Thanks!

keckler commented 2 years ago

Yeah, I will send you something via email. Sorry about there not being one in the first place.

john-science commented 2 years ago

I have a unit test that demonstrates this bug:

https://github.com/john-science/armi/commit/b075b5719cc0ef9d555e863fe3c8cbf1ed396a56

Essentially, the test always fails, but it wouldn't if we had deterministic ordering.

In particular, the attributes of the given block are printed in a different random order every time I run the test.

john-science commented 2 years ago

Okay, so this is kind of our problem.

Essentially, Yamlize is really good about deterministically ordering elements before they get written out to the YAML file.

But we go out of our way to edit the items that go into the YAML file on the fly, after Yamlize does all of it's setup and organization. For instance, in all the methods called addDegreeOfFreedom() in suiteBuilder.py:

https://github.com/terrapower/armi/blob/49ae3fd7b1a2934dbf0f642d9429bb4b1c75ab99/armi/cases/suiteBuilder.py#L274-L277

I can say that I have made a TON of YAML files with our code testing this bug, and they are very largely deterministically ordered. You have to get 4 levels down into the data heirachy inside them to start seeing non-deterministic ordering.

I am looking to see if there is a low-time-cost way of solving this.

john-science commented 2 years ago

Okay, after more trial and error, I can't find a great solution to this problem.

I tried to access the inner workings of our three YAML libraries, to fix the problem at it's source: yamlize, 'pyyaml, and 'ruamel. But the problem appears to be that changing the YAML mid-stream is the issue.

I thought I could fix this by forcing the output YAML file to be in alphabetic order:

import yaml
f = open('file_path.yaml', 'r')
doc = yaml.load(f, Loader=yaml.FullLoader)
yaml.dump(doc, sort_keys=True)

or

from ruamel.yaml import YAML
yaml = YAML()
yaml.dump(custom_yaml_sorter(yamlString), sys.stdout)

But both of these features fail pretty hard in ARMI, because of our use of YAML anchors.

I have burnt too much time on this for now, so I'm going to put it down. I think the real solution is to change the order we do things, so the Yamlize object is formed after we make calls to things like addDegreeOfFreedom(), not before.

keckler commented 2 years ago

That is an interesting point about the anchors...

One point that I would bring up is that, as I said above, the PYTHONHASHSEED effectively fixed this for me. Is there a reason we aren't wanting to go that route? As Mitch said above, apparently we used to force the PYTHONHASHSEED.

john-science commented 2 years ago

It's an interesting point.

The major reason I discounted this idea to begin with is that several downstream projects (that import ARMI) don't want PYTHONHASHSEED to be set. For instance, if you are running Monte Carlo simulations, you might/probably want the full power of your pseduo-random-number generators.

Perhaps setting the seed in ARMI wouldn't affect downstream projects. Or perhaps we could make it so. But:

  1. I would need proof we can make that safe.
  2. For those downstream projects that don't want the seed set, the YAMLs would be non-deterministic again.