Closed mmcdermott closed 6 months ago
The recent updates across the EventStream project focus on simplifying and enhancing data preprocessing. Key changes include the removal of outlier detection and normalization configurations, replaced by new attributes for setting data bounds and standardization metrics. The project also sees the simplification of metadata handling and the removal of specific preprocessing model configurations, streamlining the overall data processing workflow.
Files | Change Summary |
---|---|
EventStream/data/config.py , EventStream/data/dataset_base.py , EventStream/data/dataset_polars.py , EventStream/data/time_dependent_functor.py |
Updated to simplify preprocessing configurations and metadata handling; removed old outlier detection and normalization attributes, introduced new standardization metrics. |
configs/README.md , configs/dataset_base.yaml , configs/outlier_detector_config/stddev_cutoff.yaml |
Updated YAML configuration files to reflect new data processing settings and removal of old configurations. |
tests/data/test_config.py , tests/data/test_dataset_base.py |
Adjusted test structures to align with updated metadata schema and removed preprocessing configurations. |
configs/README.md (3)
Near line 1: Possible spelling mistake found. Context: ...nt Stream GPT package uses a mixture of yaml config files and structured configs. St... --- Near line 2: Possible spelling mistake found. Context: ...rein. Here, we will discuss the default yaml configs. ### Dataset Building Config ... --- Near line 45: Possible missing comma found. Context: ... the `measurements_per_dep_graph_level` option as that varies per dataset and cannot b...
EventStream/data/config.py (15)
117-117: Do not compare types, use `isinstance()` --- 125-125: Do not compare types, use `isinstance()` --- 407-407: Do not compare types, use `isinstance()` --- 409-409: Do not compare types, use `isinstance()` --- 411-411: Do not compare types, use `isinstance()` --- 625-625: Do not compare types, use `isinstance()` --- 631-631: Do not compare types, use `isinstance()` --- 649-649: Do not compare types, use `isinstance()` --- 649-649: Do not compare types, use `isinstance()` --- 905-905: Do not compare types, use `isinstance()` --- 907-907: Do not compare types, use `isinstance()` --- 913-913: Do not compare types, use `isinstance()` --- 1916-1916: Do not compare types, use `isinstance()` --- 1925-1925: Do not compare types, use `isinstance()` --- 1952-1952: Do not compare types, use `isinstance()`EventStream/data/dataset_polars.py (1)
202-202: Do not compare types, use `isinstance()`tests/data/test_dataset_base.py (1)
397-397: Do not compare types, use `isinstance()`
configs/dataset_base.yaml (1)
`18-18`: The addition of `center_and_scale: True` aligns with the PR's objective to simplify data standardization methods.configs/README.md (1)
`13-18`: > :memo: **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-45] Ensure that the documentation accurately reflects the removal of `normalizer_config: standard_scaler` and the addition of `center_and_scale`. This change should be explicitly mentioned to avoid confusion.EventStream/data/time_dependent_functor.py (1)
`207-224`: > :memo: **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [190-221] The modifications in the `update_from_prior_timepoint` method to handle new metadata keys (`mean`, `std`, `thresh_large`, `thresh_small`) are correctly implemented. This change supports the PR's goal of updating metadata handling.tests/data/test_dataset_base.py (1)
`475-478`: The addition of new columns (`mean`, `std`, `thresh_small`, `thresh_large`) to `empty_measurement_metadata` correctly reflects the changes in metadata structure. This update is necessary for the tests to align with the new data handling standards.tests/data/test_config.py (1)
`186-189`: The addition of new columns (`mean`, `std`, `thresh_small`, `thresh_large`) to the measurement metadata in the test cases is correctly implemented. This ensures that the tests will accurately reflect the new data handling standards.EventStream/data/dataset_base.py (6)
Line range hint `1-1`: Ensure the module docstring is updated to reflect the removal of dynamic preprocessing model fetching if this was part of the original module's functionality. --- Line range hint `143-143`: Confirm removal of `_get_preprocessing_model` aligns with the new configuration settings and does not affect other dependent methods or functionalities. --- Line range hint `143-143`: Ensure that the removal of dynamic preprocessing model fetching does not impact the functionality of `build_event_and_measurement_dfs`, especially in how it handles different schema types. --- Line range hint `143-143`: Check if the lazy loading mechanisms in `subjects_df`, `events_df`, and `dynamic_measurements_df` are still appropriate without the dynamic preprocessing models. --- Line range hint `143-143`: Verify that the dataset saving mechanism in the `save` method correctly handles the new configuration without dynamic preprocessing models. --- Line range hint `143-143`: Confirm that the dataset loading mechanism in the `load` method is compatible with the new simplified preprocessing approach and correctly handles configurations.EventStream/data/dataset_polars.py (5)
`114-117`: Update to `METADATA_SCHEMA` to include new fields for outlier handling. --- `412-415`: Updated `get_metadata_schema` method to handle new metadata fields. --- `1008-1020`: Refined outlier detection logic in `_fit_measurement_metadata` using standard deviation cutoffs. --- `1181-1189`: > :memo: **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1184-1201] Adjusted outlier detection and normalization steps in `_transform_numerical_measurement`. --- `1859-1869`: Refactored `_denormalize` method to handle denormalization based on mean and standard deviation.
Summary by CodeRabbit