terrapower / armi

An open-source nuclear reactor analysis automation framework that helps design teams increase efficiency and quality
https://terrapower.github.io/armi/
Apache License 2.0
212 stars 82 forks source link

Isolating jagged arrays for numpy compatibility #1726

Open mgjarrett opened 2 weeks ago

mgjarrett commented 2 weeks ago

NOTE: This PR is in draft. The initial development work is done, but it should not be merged until it has been fully vetted and stress-tested.

What is the change?

Implement a JaggedArray class to interface between our jagged data structures (e.g., pin-level data within a block) and HDF5 database without using numpy to flatten the jagged array.

Why is the change being made?

The current code feeds jagged arrays to numpy and then uses numpy to flatten them for storing in the database and reconstitute them when reading data back from the database. Numpy no longer supports jagged arrays, so we are currently pinned to version 1.23.5 (the latest release is 1.26.4).

The current version of numpy we are using gives a VisibleDeprecationWarning about this:

.\armi\bookkeeping\db\database3.py:1361: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  data = numpy.array(layout.location)[layoutIndicesForType][

This change introduces a JaggedArray class that handles the packing into and unpacking out of 1D data structures for the HDF5 database. This allows us to use a current version of numpy.

There is a discussion about this topic: https://github.com/terrapower/armi/discussions/1669

Also, I found this open ticket (https://github.com/terrapower/armi/issues/246) after having done all of the work in this PR. I didn't realize how much others had already thought about how to work around this. Well, I guess I'm throwing my potential solution in the ring. As was noted many times in that ticket, matching the performance of numpy might be the main challenge. Hopefully, the amount of jagged data we have is small enough that the performance hit isn't too significant.

Notes about numpy compatibility

Using a current version of numpy, if you try to pass jagged data into numpy.array(), you get an error:

>>> a = [1, 2, 3]
>>> b = [4, 5, 6, 7]
>>> c = [0, 0, 1]
>>> d = numpy.array((a, c))
>>> import numpy
>>> d = numpy.array((a, c))
>>> e = numpy.array((b, d))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

You can get numpy to accept it if you pass in dtype=object, but this data type doesn't have the features needed to facilitate flattening the array to store in HDF5, i.e., array.flatten() doesn't do anything. So I don't know whether this would help us at all. I'm not 100% positive, but I think we'd still be in the same position of needing to implement the packing and unpacking routines ourselves.

>>> e = numpy.array((b, d), dtype=object)
>>> print(e)
[list([4, 5, 6, 7]) array([[1, 2, 3],
                           [0, 0, 1]])]
>>> e.flatten()
array([list([4, 5, 6, 7]), array([[1, 2, 3],
                                  [0, 0, 1]])], dtype=object)

Checklist

john-science commented 2 days ago

@mgjarrett I'm really excited about this PR. Holler if you need anything.