Proposal for improvements to the IDTxl Data object

daehrlich commented 1 year ago

Problem: An IDTxl Data object internally saves the data samples as a three-dimensional numpy array with dimensions (n_processes, n_samples, n_repetitions). Currently, there is no way of representing more complex processes with 1) higher dimensionality and 2) meta-information such as a continuous/discrete flag that need to be passed through to the estimator layer.

Proposal: Augment the representation of samples in the Data class by replacing the internal numpy array by an nparray subclass that

contains additional meta-data such as dimensionality information and continuous/discrete flags for the individual processes
implements process-wise slicing along the first dimension, i.e. if the data consists of three processes with two dimensions each, data[1:3] should return a container of shape (4, n_samples, n_repetitions) containing the first two processes
Can easily be transformed to a regular numpy array in the estimator layer

Requirements:

Full back-compatibility with existing implementation of algorithms: If the dimensions of all variables are set to 1, the proposed ndarray subclass should behave like a regular numpy array
The numpy subclass objects need to be unpacked to regular ndarrays in the estimators to use regular indexing and slicing on the first axis.
Minimal overhead: The implementation needs to ensure that data is not copied unnecessary and memory views are used whenever possible

Example of expected behaviour

a = NewArray([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], process_dimensions=(1, 2, 1), continuous=(True, False, True))

# Process-level indexing. This shows how to get only the second process, which is two-dimensional\ a[1]\ --> NewArray[[3, 4, 5], [6, 7, 8]], process_dimensions=(2,), continuous=(False,))

# Process-level slicing\ a[:2]\ --> NewArray[[0, 1, 2], [3, 4, 5], [6, 7, 8]], process_dimensions=(1, 2), continuous=(True, False))

# Regular slicing along second and possibly array third dimension\ a[:, 1:]\ --> NewArray([[1, 2], [4, 5], [7, 8], [10, 11]], process_dimensions=(1, 2, 1), continuous=(True, False, True))

# Exposing the underlying numpy array (view) for estimation a.to_numpy()\ --> np.ndarray([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]])

Issues/Open Questions

numerical operations might return unexpected results when using the overwritten __getitem__ operator. Since algorithms are expected to only use slicing/indexing/reordering but not numerical operations, I suggest these should raise an error unless unpacked to a regular numpy array in the estimation layer

Comments This suggestion might be split into two: Just adding continuous/discrete meta information to the arrays without allowing for multi-dimensional processes should be easily achievable by a transparent plug-in replacement of the internal numpy arrays with no need for conversion in the estimator layer.

Note that this proposal is as of yet not final and subject to discussion.

pwollstadt commented 1 year ago

Hi @daehrlich, in general this sounds like a good idea. Thanks for typing this up. Maybe we should discuss if multidimensional processes are a common enough use case to include them into an updated data class. On the other hand, if the updated version is fully backwards-compatible I don't see why we should not implement it including multidimensional processes. Here I am not sure, I fully understood the proposal. Would the proposed implementation require to convert the data array to a regular numpy array at sections in the code that perform process-level slicing or numerical operations? Or should the new version work seamlessly with the current version of the toolbox?

daehrlich commented 1 year ago

Hi @pwollstadt, I submitted a draft of the data structures to the new branch dev_data_object. I slightly diverted from my original suggestion to subclass ndarrays and now instead propose to use composition instead of inheritance, which has the benefit of not exposing possibly misfunctioning operations (see Issues/Open Questions in original comment).

The proposed Process and ProcessSet objects should behave like a regular numpy array when slicing, hiding the fact that the processes might be multidimensional. A "Process" behaves like a one-dimensional numpy array, but has a hidden first axis that captures the dimensionality of the process, while a "ProcessSet" bundles several such Processes together which can be indexed process-wise on the first axis and allow regular slicing on the other axes. However, if IDTxl algorithms do other operations than slicing and rearranging (which I am currently not aware of but need to verify), then the integration of the new classes is more difficult.

Processes also carry a dictionary of additional properties which may be used in the estimator, like information whether a process is discrete or continuous, or the alphabet

In any case, the Processes and ProcessSets need to be unpacked to regular ndarrays for the estimators. This will require an additional unpacking step that needs to be included in the base Estimator, which is not yet included in the draft.

If you think multidimensional processes are not a priority at the moment but carrying properties for processes remains important (especially continuous/discrete info for mixed estimations), adding an additional attribute to a ndarray subclass might be the way to go as it should be completely backwards-compatible.

pwollstadt / IDTxl

Proposal for improvements to the IDTxl Data object #90