mobilise-d / mobgap

The Mobilise-D algorithm toolbox - Implemented in Python
https://mobgap.readthedocs.io
Apache License 2.0
32 stars 4 forks source link

index_cols may be wrongly defined #164

Closed rouzbeh closed 5 months ago

rouzbeh commented 5 months ago

When generating a dataset from existing data and running the healthy pipeline, with something similar to

from mobgap.consts import GRAV_MS2
from mobgap.data.base import ParticipantMetadata, RecordingMetadata
from mobgap.pipeline import MobilisedMetaPipeline, MobilisedPipelineHealthy
from mobgap.data import GaitDatasetFromData

gait_dataset = GaitDatasetFromData(
    {
        "18721": {
            "accelerometer": acc
        }
    },
    _sampling_rate_hz=frequency,
    _participant_metadata={
        subject_meta.subject_id: ParticipantMetadata(cohort="HA", height_m=2, sensor_height_m=1),
    },
    single_sensor_name="accelerometer",
    _recording_metadata={"18721": RecordingMetadata(
        measurement_condition= "free_living"
    )},
)

results = MobilisedPipelineHealthy().run(gait_dataset)

I get an error

../../miniconda3/envs/actihealth/lib/python3.10/site-packages/mobgap/pipeline/_mobilised_pipeline.py:363: in run
    participant_metadata = datapoint.participant_metadata
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/mobgap/data/_dataset_from_data.py:128: in participant_metadata
    self.assert_is_single(None, "participant_metadata")
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/tpcp/_dataset.py:388: in assert_is_single
    if not self.is_single(groupby_cols):
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/tpcp/_dataset.py:367: in is_single
    return len(self.groupby(groupby_cols)) == 1
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/tpcp/_dataset.py:227: in groupby
    _ = grouped_ds.grouped_index
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/tpcp/_dataset.py:186: in grouped_index
    groupby_cols = self._get_groupby_columns()
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/tpcp/_dataset.py:198: in _get_groupby_columns
    return self.index.columns.to_list()
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/tpcp/_dataset.py:36: in index
    self.subset_index = self._create_check_index()
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/tpcp/_dataset.py:74: in _create_check_index
    index_1 = self.create_index()
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/mobgap/data/_dataset_from_data.py:160: in create_index
    return pd.DataFrame(cols, columns=index_cols)
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/pandas/core/frame.py:867: in __init__
    mgr = ndarray_to_mgr(
../../miniconda3/envs/actihealth/lib/python3.10/site-packages/pandas/core/internals/construction.py:336: in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

values = array([['10821']], dtype=object)
index = RangeIndex(start=0, stop=1, step=1)
columns = Index(['index_0', 'index_1', 'index_2', 'index_3', 'index_4'], dtype='object')

    def _check_values_indices_shape_match(
        values: np.ndarray, index: Index, columns: Index
    ) -> None:
        """
        Check that the shape implied by our axes matches the actual shape of the
        data.
        """
        if values.shape[1] != len(columns) or values.shape[0] != len(index):
            # Could let this raise in Block constructor, but we get a more
            #  helpful exception message this way.
            if values.shape[0] == 0 < len(index):
                raise ValueError("Empty data passed with indices specified.")

            passed = values.shape
            implied = (len(index), len(columns))
>           raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
E           ValueError: Shape of passed values is (1, 1), indices imply (1, 5)

../../miniconda3/envs/actihealth/lib/python3.10/site-packages/pandas/core/internals/construction.py:420: ValueError

I have traced this to the following line, which I think might be wrong.

https://github.com/mobilise-d/mobgap/blob/7ac4cb48689c530542e9d9cfc13f0524c737d902/mobgap/data/_dataset_from_data.py#L159

It seems like this should be changed to:

 index_cols = [f"index_{i}" for i in range(len(cols))] 
AKuederle commented 5 months ago

Good catch! This happens because your data identifier is only the subject id. In the examples that I tested, we always had "multi-level" identifier (e.g. cohort/participant id). In this case the identifier was always a tuple and taking the len of that makes sense. If it is a string it doesn't. We need to differentiate between the "string" and the "tuple" case.

For now, there are two workarounds that you can do.

Option 1: Wrap subject id in a 1 element tuple:

from mobgap.consts import GRAV_MS2
from mobgap.data.base import ParticipantMetadata, RecordingMetadata
from mobgap.pipeline import MobilisedMetaPipeline, MobilisedPipelineHealthy
from mobgap.data import GaitDatasetFromData

gait_dataset = GaitDatasetFromData(
    {
        ("18721", ): {
            "accelerometer": acc
        }
    },
    _sampling_rate_hz=frequency,
    _participant_metadata={
        ("18721", ): ParticipantMetadata(cohort="HA", height_m=2, sensor_height_m=1),
    },
    single_sensor_name="accelerometer",
    _recording_metadata={("18721", ): RecordingMetadata(
        measurement_condition= "free_living"
    )},
)

results = MobilisedPipelineHealthy().run(gait_dataset)

Or activly overwrite the index cols:

gait_dataset = GaitDatasetFromData(
    {
        "18721": {
            "accelerometer": acc
        }
    },
    _sampling_rate_hz=frequency,
    _participant_metadata={
        subject_meta.subject_id: ParticipantMetadata(cohort="HA", height_m=2, sensor_height_m=1),
    },
    single_sensor_name="accelerometer",
    _recording_metadata={"18721": RecordingMetadata(
        measurement_condition= "free_living"
    )},
index_cols = ["participant_id"]
)

I also think there is a misunderstanding on how the data should be structured. The dictionary expected for data is {[identifier]:[sensor_pos]:data_as_df}. And the data should then contain both the accelerometer and gyroscope data.

Let me know if that works and I will put a fix for your usecase in the next couple of days.