pwollstadt / IDTxl

The Information Dynamics Toolkit xl (IDTxl) is a comprehensive software package for efficient inference of networks and their node dynamics from multivariate time series data using information theory.
http://pwollstadt.github.io/IDTxl/
GNU General Public License v3.0
243 stars 77 forks source link

Proposal for improvements to the IDTxl Data object #90

Open daehrlich opened 1 year ago

daehrlich commented 1 year ago

Problem: An IDTxl Data object internally saves the data samples as a three-dimensional numpy array with dimensions (n_processes, n_samples, n_repetitions). Currently, there is no way of representing more complex processes with 1) higher dimensionality and 2) meta-information such as a continuous/discrete flag that need to be passed through to the estimator layer.

Proposal: Augment the representation of samples in the Data class by replacing the internal numpy array by an nparray subclass that

Requirements:

Example of expected behaviour

a = NewArray([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], process_dimensions=(1, 2, 1), continuous=(True, False, True))

# Process-level indexing. This shows how to get only the second process, which is two-dimensional\ a[1]\ --> NewArray[[3, 4, 5], [6, 7, 8]], process_dimensions=(2,), continuous=(False,))

# Process-level slicing\ a[:2]\ --> NewArray[[0, 1, 2], [3, 4, 5], [6, 7, 8]], process_dimensions=(1, 2), continuous=(True, False))

# Regular slicing along second and possibly array third dimension\ a[:, 1:]\ --> NewArray([[1, 2], [4, 5], [7, 8], [10, 11]], process_dimensions=(1, 2, 1), continuous=(True, False, True))

# Exposing the underlying numpy array (view) for estimation a.to_numpy()\ --> np.ndarray([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]])

Issues/Open Questions

Comments This suggestion might be split into two: Just adding continuous/discrete meta information to the arrays without allowing for multi-dimensional processes should be easily achievable by a transparent plug-in replacement of the internal numpy arrays with no need for conversion in the estimator layer.

Note that this proposal is as of yet not final and subject to discussion.

pwollstadt commented 1 year ago

Hi @daehrlich, in general this sounds like a good idea. Thanks for typing this up. Maybe we should discuss if multidimensional processes are a common enough use case to include them into an updated data class. On the other hand, if the updated version is fully backwards-compatible I don't see why we should not implement it including multidimensional processes. Here I am not sure, I fully understood the proposal. Would the proposed implementation require to convert the data array to a regular numpy array at sections in the code that perform process-level slicing or numerical operations? Or should the new version work seamlessly with the current version of the toolbox?

daehrlich commented 1 year ago

Hi @pwollstadt, I submitted a draft of the data structures to the new branch dev_data_object. I slightly diverted from my original suggestion to subclass ndarrays and now instead propose to use composition instead of inheritance, which has the benefit of not exposing possibly misfunctioning operations (see Issues/Open Questions in original comment).

The proposed Process and ProcessSet objects should behave like a regular numpy array when slicing, hiding the fact that the processes might be multidimensional. A "Process" behaves like a one-dimensional numpy array, but has a hidden first axis that captures the dimensionality of the process, while a "ProcessSet" bundles several such Processes together which can be indexed process-wise on the first axis and allow regular slicing on the other axes. However, if IDTxl algorithms do other operations than slicing and rearranging (which I am currently not aware of but need to verify), then the integration of the new classes is more difficult.

Processes also carry a dictionary of additional properties which may be used in the estimator, like information whether a process is discrete or continuous, or the alphabet

In any case, the Processes and ProcessSets need to be unpacked to regular ndarrays for the estimators. This will require an additional unpacking step that needs to be included in the base Estimator, which is not yet included in the draft.

If you think multidimensional processes are not a priority at the moment but carrying properties for processes remains important (especially continuous/discrete info for mixed estimations), adding an additional attribute to a ndarray subclass might be the way to go as it should be completely backwards-compatible.