mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.
https://eventstreamml.readthedocs.io/en/latest/
MIT License
102 stars 16 forks source link

Removed pre-processors as there was only one option in use now and these will be phased out with MEDS #97

Closed mmcdermott closed 6 months ago

mmcdermott commented 7 months ago

Summary by CodeRabbit

coderabbitai[bot] commented 7 months ago

Walkthrough

The recent updates across the EventStream project focus on simplifying and enhancing data preprocessing. Key changes include the removal of outlier detection and normalization configurations, replaced by new attributes for setting data bounds and standardization metrics. The project also sees the simplification of metadata handling and the removal of specific preprocessing model configurations, streamlining the overall data processing workflow.

Changes

Files Change Summary
EventStream/data/config.py, EventStream/data/dataset_base.py, EventStream/data/dataset_polars.py, EventStream/data/time_dependent_functor.py Updated to simplify preprocessing configurations and metadata handling; removed old outlier detection and normalization attributes, introduced new standardization metrics.
configs/README.md, configs/dataset_base.yaml, configs/outlier_detector_config/stddev_cutoff.yaml Updated YAML configuration files to reflect new data processing settings and removal of old configurations.
tests/data/test_config.py, tests/data/test_dataset_base.py Adjusted test structures to align with updated metadata schema and removed preprocessing configurations.

Recent Review Details **Configuration used: CodeRabbit UI** **Review profile: CHILL**
Commits Files that changed from the base of the PR and between 9eead533a48107e392dad113f3ed4cffd4802b45 and a51e695cba93a6ea4436185ff6a0082e85b3b9a8.
Files selected for processing (9) * EventStream/data/config.py (7 hunks) * EventStream/data/dataset_base.py (1 hunks) * EventStream/data/dataset_polars.py (8 hunks) * EventStream/data/time_dependent_functor.py (2 hunks) * configs/README.md (1 hunks) * configs/dataset_base.yaml (2 hunks) * configs/outlier_detector_config/stddev_cutoff.yaml (1 hunks) * tests/data/test_config.py (14 hunks) * tests/data/test_dataset_base.py (1 hunks)
Files skipped from review due to trivial changes (1) * configs/outlier_detector_config/stddev_cutoff.yaml
Additional Context Used
LanguageTool (3)
configs/README.md (3)
Near line 1: Possible spelling mistake found. Context: ...nt Stream GPT package uses a mixture of yaml config files and structured configs. St... --- Near line 2: Possible spelling mistake found. Context: ...rein. Here, we will discuss the default yaml configs. ### Dataset Building Config ... --- Near line 45: Possible missing comma found. Context: ... the `measurements_per_dep_graph_level` option as that varies per dataset and cannot b...
Ruff (17)
EventStream/data/config.py (15)
117-117: Do not compare types, use `isinstance()` --- 125-125: Do not compare types, use `isinstance()` --- 407-407: Do not compare types, use `isinstance()` --- 409-409: Do not compare types, use `isinstance()` --- 411-411: Do not compare types, use `isinstance()` --- 625-625: Do not compare types, use `isinstance()` --- 631-631: Do not compare types, use `isinstance()` --- 649-649: Do not compare types, use `isinstance()` --- 649-649: Do not compare types, use `isinstance()` --- 905-905: Do not compare types, use `isinstance()` --- 907-907: Do not compare types, use `isinstance()` --- 913-913: Do not compare types, use `isinstance()` --- 1916-1916: Do not compare types, use `isinstance()` --- 1925-1925: Do not compare types, use `isinstance()` --- 1952-1952: Do not compare types, use `isinstance()`
EventStream/data/dataset_polars.py (1)
202-202: Do not compare types, use `isinstance()`
tests/data/test_dataset_base.py (1)
397-397: Do not compare types, use `isinstance()`
Additional comments not posted (16)
configs/dataset_base.yaml (1)
`18-18`: The addition of `center_and_scale: True` aligns with the PR's objective to simplify data standardization methods.
configs/README.md (1)
`13-18`: > :memo: **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-45] Ensure that the documentation accurately reflects the removal of `normalizer_config: standard_scaler` and the addition of `center_and_scale`. This change should be explicitly mentioned to avoid confusion.
EventStream/data/time_dependent_functor.py (1)
`207-224`: > :memo: **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [190-221] The modifications in the `update_from_prior_timepoint` method to handle new metadata keys (`mean`, `std`, `thresh_large`, `thresh_small`) are correctly implemented. This change supports the PR's goal of updating metadata handling.
tests/data/test_dataset_base.py (1)
`475-478`: The addition of new columns (`mean`, `std`, `thresh_small`, `thresh_large`) to `empty_measurement_metadata` correctly reflects the changes in metadata structure. This update is necessary for the tests to align with the new data handling standards.
tests/data/test_config.py (1)
`186-189`: The addition of new columns (`mean`, `std`, `thresh_small`, `thresh_large`) to the measurement metadata in the test cases is correctly implemented. This ensures that the tests will accurately reflect the new data handling standards.
EventStream/data/dataset_base.py (6)
Line range hint `1-1`: Ensure the module docstring is updated to reflect the removal of dynamic preprocessing model fetching if this was part of the original module's functionality. --- Line range hint `143-143`: Confirm removal of `_get_preprocessing_model` aligns with the new configuration settings and does not affect other dependent methods or functionalities. --- Line range hint `143-143`: Ensure that the removal of dynamic preprocessing model fetching does not impact the functionality of `build_event_and_measurement_dfs`, especially in how it handles different schema types. --- Line range hint `143-143`: Check if the lazy loading mechanisms in `subjects_df`, `events_df`, and `dynamic_measurements_df` are still appropriate without the dynamic preprocessing models. --- Line range hint `143-143`: Verify that the dataset saving mechanism in the `save` method correctly handles the new configuration without dynamic preprocessing models. --- Line range hint `143-143`: Confirm that the dataset loading mechanism in the `load` method is compatible with the new simplified preprocessing approach and correctly handles configurations.
EventStream/data/dataset_polars.py (5)
`114-117`: Update to `METADATA_SCHEMA` to include new fields for outlier handling. --- `412-415`: Updated `get_metadata_schema` method to handle new metadata fields. --- `1008-1020`: Refined outlier detection logic in `_fit_measurement_metadata` using standard deviation cutoffs. --- `1181-1189`: > :memo: **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1184-1201] Adjusted outlier detection and normalization steps in `_transform_numerical_measurement`. --- `1859-1869`: Refactored `_denormalize` method to handle denormalization based on mean and standard deviation.
--- Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger a review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.