mmcdermott / EventStreamGPT

Dataset and modelling infrastructure for modelling "event streams": sequences of continuous time, multivariate events with complex internal dependencies.
https://eventstreamml.readthedocs.io/en/latest/
MIT License
102 stars 16 forks source link

Support control for how events are aggregated/combined into compound events #111

Closed juancq closed 5 months ago

juancq commented 5 months ago

There are times when I want events falling on the same timestamp to be combined, and other times where I would want them kept separate. Suppose I have the following, where admissions refers to hospital admissions, procedures have procedure dates associated with them, and procedures always occur in hospital admissions and never in emergency visits:

inputs: 
  subjects:
    ...
  admissions:
    input_df: 'admissions.parquet'
    ...
  emergency:
    input_df: 'emergency.parquet'
    ...
  procedures: # these only occur along with admission events
    input_df: 'procedures.parquet'
    ...

measurements: 
  ...
  dynamic:
    multi_label_classification: 
      admissions:
        - facility_type
        - acute_type
        ...
      emergency:
        - separation_code
        - triage_category
        ...
      procedure:
        - procedure_code

Suppose I have an admission event, emergency event, and a procedure event occurring on the same timestamp. How can I indicate that admission and procedure can be merged to a compound event (of type ADMISSION_PROCEDURE) but emergencies should be kept as a separate event? With my sequence of events ideally looking like:

event_type date ... blah
u32 datetime[μs] ... blah
... 2010-03-15 00:00:00
EMERGENCY 2010-10-20 00:00:00
ADMISSION_PROCEDURE 2010-10-20 00:00:00
mmcdermott commented 5 months ago

Unfortunately, @juancq, Within EventStreamGPT, all events of the exact same timestamp are merged into single events. This is a hard design decision that is not going to change in future versions. However, there is a successor of ESGPT that will enable seamless transitions from ESGPT datasets, scaling to much larger datasets, more automated tooling for downstream use cases (e.g., automatic baseline pipelines, etc.) that also supports this kind of use case. This is the MEDS framework, which you can read more about here, though it is still in progress: https://github.com/mmcdermott/MEDS_polars_functions and https://github.com/Medical-Event-Data-Standard/meds

Happy to also connect via a call at some point to discuss your use cases and brainstorm ways to work within ESGPT for your use or how to transition to MEDS most effectively as it grows in maturity over the coming months.

I will say the easy fix for this (as graceless as this is) would just be to add a little bit of specialty code to add a microsecond to the timestamp of the events you don't want merged. But, this will also break or diminish some downstream use-cases, so I wouldn't necessarily recommend it.