simphony / simphony-common

The native implementation of the Simphony cuds objects
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

CUBA keyword default types: Consider changing to native Python instead of NumPy #308

Closed tuopuu closed 7 years ago

tuopuu commented 8 years ago

Question: Are there any compulsory reasons to specifically use NumPy arrays and NumPy data types to store values in the basic CUBA keys? This is defined in simphony/core/keywords.py and type casting of the values is done in simphony.cuds.meta.validation.cast_data_type(). An option would be to use native Python types (int, float, string, etc.) instead of NumPy types.

The reason I'm asking this is that we are planning to use PyYaml functionality to serialize CUDSComponents into a Yaml script file that is easy to read and edit. Using NumPy types, the default emitted Yaml of a CUBA.BOX object would look like this:


---
!!python/object:simphony.cuds.meta.box.Box
_data: !!python/object/new:simphony.cuds.meta.box._RestrictedDataContainer
  dictitems:
    ? !!python/object/apply:simphony.core.cuba.CUBA [129]
    : some_name
    ? !!python/object/apply:simphony.core.cuba.CUBA [134]
    : - !!python/object/apply:numpy.core.multiarray._reconstruct
        args:
        - &id001 !!python/name:numpy.ndarray ''
        - !!python/tuple [0]
        - b
        state: !!python/tuple
        - 1
        - !!python/tuple [3]
        - !!python/object/apply:numpy.dtype
          args: [f8, 0, 1]
          state: !!python/tuple [3, <, null, null, null, -1, -1, 0]
        - false
        - !!binary |
          AAAAAAAA8D8AAAAAAAAAQAAAAAAAAAhA
      - !!python/object/apply:numpy.core.multiarray._reconstruct
        args:
        - *id001
        - !!python/tuple [0]
        - b
        state: !!python/tuple
        - 1
        - !!python/tuple [3]
        - !!python/object/apply:numpy.dtype
          args: [f8, 0, 1]
          state: !!python/tuple [3, <, null, null, null, -1, -1, 0]
        - false
        - !!binary |
          AAAAAAAACMAAAAAAAADwPwAAAAAAAPC/
      - !!python/object/apply:numpy.core.multiarray._reconstruct
        args:
        - *id001
        - !!python/tuple [0]
        - b
        state: !!python/tuple
        - 1
        - !!python/tuple [3]
        - !!python/object/apply:numpy.dtype
          args: [f8, 0, 1]
          state: !!python/tuple [3, <, null, null, null, -1, -1, 0]
        - false
        - !!binary |
          AAAAAAAA8D8AAAAAAADwPwAAAAAAAPA/
    ? !!python/object/apply:simphony.core.cuba.CUBA [164]
    : some_description
_definition: A simple hexahedron box object
_uid: !!python/object:uuid.UUID {int: 269147399971988598877834418230928817611}

However, with the simple change of the data type of CUBA.VECTOR (currently numpy.float64) to native Python list of floats, the same script becomes:


---
!!python/object:simphony.cuds.meta.box.Box
_data: !!python/object/new:simphony.cuds.meta.box._RestrictedDataContainer
  dictitems:
    ? !!python/object/apply:simphony.core.cuba.CUBA [129]
    : some_name
    ? !!python/object/apply:simphony.core.cuba.CUBA [134]
    : [[1.0, 2.0, 3.0], [-3.0, 1.0, -1.0], [1.0, 1.0, 1.0]]
    ? !!python/object/apply:simphony.core.cuba.CUBA [164]
    : some_description
_definition: A simple hexahedron box object
_uid: !!python/object:uuid.UUID {int: 269147399971988598877834418230928817611}

Also, please, note that the Yaml script format shown above is subject to changes due to some other issues regarding numbering of the CUBA keys etc.

@kemattil, @ahashibon, @kitchoi, @mehdisadeghi, @roigcarlo please, I'd like to hear your opinions on this issue.

tuopuu commented 8 years ago

I'd like to add that it's perfectly fine to continue using NumPy for the datasets (Lattice, Particles, Mesh). They are currently stored separately into an HDF5 file.

kitchoi commented 8 years ago

The problem here is regarding the lack of readability for the serialised numpy arrays, and that the output yaml file can be loaded back and the meta classes can be restored, right?

One advantage of numpy array is that it is simple to check data type and shape, even if you have multidimensional arrays (2D, 3D, 4D,...). However, I do think the readability of the output yaml file is important that we can have the default arrays as simple python types (list,...).

Yet I don't think we should prevent the user from using numpy array if they wish to, this should be allowed:

>>> box.vector = array([[1, 2, 3], [3, 2, 1], [0, 1, 0]])
>>> box.vector
array([[1, 2, 3], [3, 2, 1], [0, 1, 0]])

But should we do some minimal data type checking:

>>> box.vector = [[1, 2, 3], [3, 'sssssss', 1], [0, 1, 0]]
# Should raise here
tuopuu commented 8 years ago

This issue is still relevant with the current state of the implementation of CUDS serialization. Numpy arrays transformed into a string format contain word 'array' and extra parentheses, that can't be automatically read back using PyYaml library methods.

tuopuu commented 7 years ago

YAML serialisation was implemented in #322