Testing Refactor - Githubissues

brittonsmith commented 2 months ago

I'm rewriting the PR description now that this is nearing a completed state.

This PR represents a significant refactor of the FluidContainer and the infrastructure for running and testing the example Python scripts. Some of this includes responses to earlier comments made here.

FluidContainer

changed all field names to match the field struct in the primary API. The old names are deprecated and a warning will be printed when using them.
refactored functionality for determining which fields are necessary based on parameter configuration
added deprecation warning for use of "mu" in favor of "mean_molecular_weight"
deprecated calculate_hydrogen_number_density and implemented a generalized calculate_nuclei_density method that accepts an element name
added calculate_cooling_rate method
added finalize_data method to return copies of data arrays with units added (as unyt_arrays)
refactored setup_fluid_container helper function to cover all density fields and added two initial "state" options (neutral and ionized)
minor refactors of evolve_constant_density and evolve_freefall to standardize the data they return

Testing and example script refactor

The original test_examples.py code allowed for minor alterations of the Python examples for testing via changing environment variables. I have replaced this with test_models.py, which allows one to define a grid of parameters and input variables over which to run and produce output for comparison. When the Python examples are run by the tester, it passes sys.argv arguments to denote the specific parameter and input configurations. The Python script then calls get_model_set to get a properly configured chemistry_data object and a dictionary of variables that I then bring into the runtime namespace with calls to globals(). We don't have to do it this way. We can also just explicitly set variables (e.g., metallicity = input_set["metallicity"]). In doing this, I have also altered the yt_grackle.py script to save output so we can test against that as well rather than just verifying that it runs without error.

As mentioned in the previous PR description, I've also added test_local_functions.py to do answer testing on the local functions using a grid of parameters and randomized, reasonable inputs. This both allows us to do explicit answer testing on these and also for others to have a base of stock test answers for doing the same for development with the primary API. This will generate and test from a json file which looks like the following: http://paste.yt-project.org/show/580/

Remaining to-dos

[x] I seem to have introduced some test failures in some of the unit tests that rely on FluidContainer and setup_fluid_container. I'll have a look at those.

Points for discussion

This introduces a lot more answer files to store in the grackle_data_files submodule. I suppose that isn't a problem itself, but I can already see a few complications.

The files are currently named using the indices of the parameter and input sets. If we change the range of parameters and inputs, these could get scrambled and would need to be regenerated. Maybe we could do something like what @mabruzzo suggested previously and give each testing set some sort of string name and save the files using that.
As is clear from the test failures, there seems to be some machine dependence for the answers. I am hesitant to simply loosen the tolerances until they pass. It may be better to move to a system like in enzo-dev where the CI generates its own answers by moving to a "gold standard" commit, then moves forward to the current version for actual comparison. Doing this would require significant modification of the circeci config script, so I'd like to have a discussion here about whether this is the right way to go before I do it.
We don't currently do this, but we could embed the grackle configuration into the yt dataset that is saved to verify the correct parameter set is being compared against. We could also embed some sort of format version number this way and add some assert to the testing.

mabruzzo commented 2 months ago

Hi Britton, this is looking fantastic! I have a lot of thoughts on this, but I'm a little limited on time right this moment.

I wanted to share a few initial thoughts, based on taking a glance at the outputs (I didn't look at your code yet). I'll come back later and add more.

Let me preface this by saying, I definitely don't want to bog down your efforts endless bike-shedding, especially since this is something we could change later without breaking user-facing code. Before going on, I do we think should add a top level json attribute like "format-version" in order to more easily facilitate changes.

Now, I want to pose an open-ended generic question (that we don't necessarily need to answer now): "How much do we care about specifying Grackle's precise state?"

The main motivating point behind this question is that the difference between the code_units instance used to initialize the chemistry_data_storage struct and the instance used when solving chemistry/calculating local derived quantities will slightly alter the results.
- The difference should be very tiny, but it may be important in certain cases
- But it may be important how precisely this format should be able to specify arbitrary states.
I could rephrase the question as: what is the "format's primary purpose?"
1. Is it to specify Grackle's input state? (And then report the results reported by grackle)
  - In this case, we are essentially defining a serialization format.
  - If we are going this route, it might be worth taking a quick peek at #197. That PR proposes a C-level function to try to serialize the precise state of grackle to an hdf5 file for debugging (It also happens to support dumping chemistry_data, code_units and grackle_version to json, but I'm not sure that will help much here).
  - I'm mostly referencing that other PR for your awareness (I think JSON is probably better for this than HDF5 for specifying test-inputs since JSON is human-readable understand)
2. To specify the inputs for well-posed physics question and let the test-implementor worry about mapping the physics to the Grackle's data structures.
Going with case 1 would involve a lot more up front work (and a lot more boiler plate code). It would also mean that we would need to make revisions if we alter Grackle's API. For those reasons it may be worth going with case 2. On the other hand, it is not clear which approach is more maintainable... (especially, if we make a large API change)
I think you could cast this question in terms of white-box vs black-box testing.

Finally, I wanted to share few other quick thoughts. A lot of this is spitballing (so I tried to label how strongly I feel about certain things -- from strong suggestion to an idea):

A couple thoughts on input field data:
- strong suggestion: to simplify the json parsing (especially in C/C++ test code), I think we should always put the data associated with each input field inside a 1D list (even if it is only a single element).
- potential idea: put all field inside a nested group within fields. For clarity, a given "input" json object, would contain nested object containing "density", "x-velocity", etc. (Not sure if this is necessary)
- suggestion: maybe record other information about the fields, including 'grid_dx' (could be omitted when not relevant), 'grid_rank', 'grid_dimension' (to make parsing easy, I would always make the last one hold 3 elements -- the valid elements are specified grid_rank)
- strong suggestion: to support C/C++ unit-tests (and anybody who wants to manually specify one of these json files), it would probably be better to use the grackle_field_data member-names instead of the names we use in FluidContainer (e.g. velocity_x vs x-velocity, e_density vs de, internal_energy vs energy, HII_density vs HII )
  - This would be a lot easier to do in the python layer (during test-generation), rather than requiring the C tests to handle this mapping.
  - This came up in PR #197. In that PR, I implemented part of the logic to handle the mapping (in 1 direction) here.
A couple miscellaneous thoughts:
- idea: rather than specifying redshift, maybe we should be specifying the current_scale_factor since that is what the underlying code uses... (unfortunately, this introduces a little ambiguity with respect to a_units). I think this is tied to the broader question I asked up above
- idea: Maybe we should be naming the scenarios (I don't know what to call the high-level json objects) and tests rather than putting them in lists. This could make it easier for humans to map between the stuff being loaded in code and what is found in the json file... (especially if errors occur).

My response clearly got a little out of hand! As I mentioned above, we definitely don't need to decide on everything now (especially if we label the file-version)

brittonsmith commented 2 months ago

Matthew, many thanks for these thoughts. With regard to your broader question, I think we should care quite a bit about knowing Grackle's state. I don't want to quantify that at the moment, but it seems a large number of the issues users encounter stem from improper units. When this code was all in Enzo, it wasn't such a dangerous proposition to initialize the chemistry with one set of units, throw them away, and then rely on the user to provide new (and likely different) but consistent ones. I'd prefer this discussion to minimally impact the rest of the work here as I think it will be easier to allow the format used here to change as this discussion evolves.

strong suggestion: to simplify the json parsing (especially in C/C++ test code), I think we should always put the data associated with each input field inside a 1D list (even if it is only a single element).

Sure, can do.

potential idea: put all field inside a nested group within fields. For clarity, a given "input" json object, would contain nested object containing "density", "x-velocity", etc. (Not sure if this is necessary)

suggestion: maybe record other information about the fields, including 'grid_dx' (could be omitted when not relevant), 'grid_rank', 'grid_dimension' (to make parsing easy, I would always make the last one hold 3 elements -- the valid elements are specified grid_rank)

I'm agnostic on these as I sort of expected anyone using the json file for their own purposes would have to do some by-hand additional setup as I'm not specifying the complete state, only the ones I deemed minimally necessary. If anything, I was hoping that this approach would validate or falsify the notion that what I have there are in fact the minimal inputs to supply. As well, I was wary of overcomplicating things by trying to classify the various forms of "input" so this could all be initialized in the most general way.

strong suggestion: to support C/C++ unit-tests (and anybody who wants to manually specify one of these json files), it would probably be better to use the grackle_field_data member-names instead of the names we use in FluidContainer (e.g. velocity_x vs x-velocity, e_density vs de, internal_energy vs energy, HII_density vs HII )

You're right on about this. In fact, I think we should be moving to consistent names over all the APIs. I'd be in favor (in not necessarily in this PR) of deprecating the Pygrackle-only names and phasing them out.

idea: rather than specifying redshift, maybe we should be specifying the current_scale_factor since that is what the underlying code uses... (unfortunately, this introduces a little ambiguity with respect to a_units). I think this is tied to the broader question I asked up above

My issue with this is I think user intuition is in redshift-space and not so much scale-factor. That said, the cooling tables themselves are evenly spaced in log(1 + z) so it may make sense to sample randomly from that to cover the data better.

idea: Maybe we should be naming the scenarios (I don't know what to call the high-level json objects) and tests rather than putting them in lists. This could make it easier for humans to map between the stuff being loaded in code and what is found in the json file... (especially if errors occur).

I considered this, but I concluded that the number of different value combinations we might like to test would vastly exceed our naming vocabulary. As well, the values are currently totally random and don't map to a particular scenario. There could be value in testing specific named situations, though.

That's all I've got for now. I'm happy to keep iterating on any of these points, but some I wholeheartedly agree with and will implement. I'll then continue on with something similar for tests of the Python examples. This is the main thing I'd like to have in place for finalizing Gen's PR.

ChristopherBignamini commented 2 months ago

Hi @brittonsmith , thank you very much for this PR. Given my limited knowledge of the Grackle code I'm not in the conditions of discussing the technical details of the proposed solution but if I correctly understand the general approach for dataset generation I think it is exactly what we need for an extensive testing of the code!

brittonsmith commented 2 months ago

@ChristopherBignamini, excellent. A high priority is to get something that works for you, so please do let me know if there are changes to the json format that can be helpful. Following Matthew's suggestions, I have made a few changes. The current version of the file can be see here: http://paste.yt-project.org/show/571/ Please, let me know how this works for you.

mabruzzo commented 1 month ago

I meant to take a look at this, but the day got away from me. A few quick thoughts (I'm happy to expand on it).

The files are currently named using the indices of the parameter and input sets. If we change the range of parameters and inputs, these could get scrambled and would need to be regenerated. Maybe we could do something like what @mabruzzo suggested previously and give each testing set some sort of string name and save the files using that

I like that, but I'm obviously a little biased

As is clear from the test failures, there seems to be some machine dependence for the answers. I am hesitant to simply loosen the tolerances until they pass. It may be better to move to a system like in enzo-dev where the CI generates its own answers by moving to a "gold standard" commit, then moves forward to the current version for actual comparison. Doing this would require significant modification of the circeci config script, so I'd like to have a discussion here about whether this is the right way to go before I do it.

I agree this is probably the way to go. I have my hesitations, but I suspect it's probably the most rigorous solution

With that said, I'm a little optimistic that we could reuse some machinery to make this easier. In particular, if we were able to merge in #208, and took some of the answer-testing machinery from Enzo-E[^1] (namely chunks of the conftest.py and answer_testing.py), I think the minimum amount of circleci machinery to get this working would be relatively simple. We could add a new job that installs dependencies, clones grackle, then simply runs

cd grackle
git tag tip

# build/install the gold-standard (this freshly builds Grackle)
git checkout ${GOLD_STANDARD}
pip install .

# run the gold-standard
cd src/python
pytest tests --answer-store --local-dir=./gold-standard-answers

# cleanup
pip uninstall pygrackle
cd -

# build/install the version of grackle we want to test
git checkout tip
pip install .

# run the tests
cd src/python
pytest tests --local-dir=./gold-standard-answers

(I could be wrong about the precise arguments -- I'm happy to help with that!) Alternatively, we could pull machinery from yt. The command line interface is very similar (I modeled Enzo-E's interface after yt), but I think their implementation is a little more idiomatic for pytest. I'm also happy to take a look at all of this and help.

We don't currently do this, but we could embed the grackle configuration into the yt dataset that is saved to verify the correct parameter set is being compared against. We could also embed some sort of format version number this way and add some assert to the testing.

That certainly would be great! But, would that take much time to implement?

[^1]: I don't think we wouldn't even need to include the license since you and I are the only people who have touched that code.

brittonsmith commented 1 month ago

@mabruzzo, this is now ready for review. Here is a summary of the changes:

incorporate generation of answer test answers from a gold standard into circleci
update documentation of test suite
add test_local_functions answer tests to test "calculate" functions with random fields and store values in json file
replace test_python_examples with test_models to test python example scripts with grids of runtime parameters
update test_code_examples to store/compare answers from code examples using a json file
standardize output of code examples to facilitate above
update FluidContainer to use same field names as main library field struct (i.e., "HI_density" instead of "HI") and deprecate old names
update all python tests and examples for above
refactor setup_fluid_container helper function to standardize python examples

Note: the grackle_data_files submodule temporarily points to my fork as I have remove the old test answer files. When this is ready to do. I propose we do the following. Once you approve the PR, I will merge grackle_data_file PR #10, then update this PR to point back to the main repo with the appropriate changeset. When the tests pass one last time, we merge.

I seem to have introduced a testing issue with the last few changes. I'll fix that now.

brittonsmith commented 1 month ago

Update: the failing test was introduced when I added the testing of the code example results to the test suite. This happened because I now import from pygrackle within test_code_examples.py (we didn't before), but pygrackle can't be built with Grackle compiled with openmp. I've fixed the issue by simply not having circleci run pytest tests/test_code_examples.py and have it just manually run the cxx_omp_example.

mabruzzo commented 1 month ago

Great! I started going through it earlier today and it looks great so far! I'll try to come back to it tomorrow (but probably not before the Grackle call)

brittonsmith commented 1 month ago

Final update: while going through issues, I realize we are in striking distance of closing off Issue #172 (and it's relevant), so I added documentation for installing development dependencies. With this PR, I believe we can also close Issue #171 and Issue #134.

brittonsmith commented 3 weeks ago

@mabruzzo, I believe I've responded to all your comments now. Here's a summary of changes:

removed individual test answer generation variables
added code to get the grackle_data_dir from pygrackle.utilities.data_path. This will work for an editable install. Otherwise, it will tell the user to set it with the GRACKLE_DATA_DIR environment variable. I realize you have other plans for managing data paths, so I didn't want to do anything too sophisticated here.
moved some pertinent paths to python/tests/testing_common.py
moved format version variables to relevant files
added much more explicit output for failed local functions tests

brittonsmith commented 3 weeks ago

Final update: I added some documentation for setting the grackle data path. I'm happy for all of this to be redone if/when you have a new solution.

brittonsmith commented 3 weeks ago

@mabruzzo, this might have been presumptuous, but I've now merged PR #10 and updated the submodule configuration here. I hadn't noticed that you approved some time ago. I'm not sure if you want to re-approve the latest changes, but if things look good you can merge.

mabruzzo commented 3 weeks ago

I'm merging in. Just a heads up: I'm working on a PR, to be posted on Monday that totally updates how we handle the data directory. This is going to totally break any code relying upon the behavior of GRACKLE_DATA_DIR that is introduced in this PR. But I don't think that will be a problem

grackle-project / grackle

Testing Refactor #215

FluidContainer

Testing and example script refactor

Remaining to-dos

Points for discussion