I would like to try these ideas out on a fork if that makes more sense, and
merge it later.
Currently, sympl assumes that the arrays inside the state dictionary are
instances of DataArray. While this made sense initially, I'm continually
coming up against performance issues (like https://github.com/mcgibbon/sympl/issues/43).
For instance,
get_numpy_array uses the .transpose() function of DataArray
which is very slow. I wrote an equivalent version of this function
which instead used the numpy version which is ~20-30% faster (and passes all tests).
accessing an attribute like .values or .dims involves multiple function calls since
they are properties which reference other properties internally.
creating a new DataArray has a huge __init__ overhead with all kinds of checks
which are really not necessary in our use case.
These issues really come to the front when writing models which work with a single
column of data, which currently is the major use-case for climt at least.
While it is desirable to keep the DataArray interface, it would be really helpful
downstream if sympl described an API which any array object must implement.
This will require some re-writing of internal code which assumes that the
arrays are DataArrays, but in the end will allow more performant array representations
like unyt to be used seamlessly in sympl components.
This might also require sympl to allow an implementing library to replace functions
like get_numpy_array with custom versions.
In general, it might be good to specify a number of functions that an implementing library
must provide which can replace the logic that currently resides within __call__ of any
sympl component. This will make it easy to add functionality without having to
build custom subclasses of the base sympl components, which is undesirable.
IMO this also makes sense since sympl is a framework, and it need
not be opinionated about what kind of arrays are used, or how the validation
of these arrays and their dimensions is done. sympl could register
callbacks based on the type of the input array formats and use them
for validation and reshaping.
I would like to try these ideas out on a fork if that makes more sense, and merge it later.
Currently, sympl assumes that the arrays inside the state dictionary are instances of
DataArray
. While this made sense initially, I'm continually coming up against performance issues (like https://github.com/mcgibbon/sympl/issues/43).For instance,
get_numpy_array
uses the.transpose()
function ofDataArray
which is very slow. I wrote an equivalent version of this function which instead used thenumpy
version which is ~20-30% faster (and passes all tests).accessing an attribute like
.values
or.dims
involves multiple function calls since they are properties which reference other properties internally.creating a new
DataArray
has a huge__init__
overhead with all kinds of checks which are really not necessary in our use case.These issues really come to the front when writing models which work with a single column of data, which currently is the major use-case for
climt
at least.While it is desirable to keep the
DataArray
interface, it would be really helpful downstream ifsympl
described an API which any array object must implement. This will require some re-writing of internal code which assumes that the arrays areDataArray
s, but in the end will allow more performant array representations likeunyt
to be used seamlessly insympl
components.This might also require
sympl
to allow an implementing library to replace functions likeget_numpy_array
with custom versions.In general, it might be good to specify a number of functions that an implementing library must provide which can replace the logic that currently resides within
__call__
of anysympl
component. This will make it easy to add functionality without having to build custom subclasses of the basesympl
components, which is undesirable.IMO this also makes sense since
sympl
is a framework, and it need not be opinionated about what kind of arrays are used, or how the validation of these arrays and their dimensions is done.sympl
could register callbacks based on the type of the input array formats and use them for validation and reshaping.