justin13601 / ACES

ACES: Automatic Cohort Extraction System for Event-Streams
https://eventstreamaces.readthedocs.io/en/latest/
MIT License
24 stars 1 forks source link

What should we do about additional extracted data outside the scope of the MEDS label schema? #97

Open mmcdermott opened 2 months ago

mmcdermott commented 2 months ago

Options are:

  1. Write that info to an additional file.
  2. Write those columns anyways, and not be truly compliant.
  3. See if MEDS can expand the label_schema to include additional columns much as data does.
Oufattole commented 2 months ago

The use case I have in mind for additional data is that users may wish to extract windows of data for contrastive learning tasks. So I may wish to extract a window of data prior to and after an event (inpatient admissions for example). For each window you need a start and end time, and that is it right?

So the label_schema is:

label = pa.schema(
    [
        ("subject_id", pa.int64()),
        ("prediction_time", pa.timestamp("us")),
        ("boolean_value", pa.bool_()),
        ("integer_value", pa.int64()),
        ("float_value", pa.float64()),
        ("categorical_value", pa.string()),
    ]
)

It seems the intention of the label schema is for supervised tasks where you just need a prediction time and label, so it doesn't seem appropriate to add window information to that. I would advocate for an additional file with window-based data using a struct per window as you used in v0.3.2 of aces (I think).

Let's suppose for my example I use this config:

predicates:
    admission:
        code: ADMISSION

trigger: admission

windows:
    pre:
        start: null
        end: trigger
        start_inclusive: True
        end_inclusive: False
    post:
        start: pre.end
        end: null
        start_inclusive: True
        end_inclusive: True

Maybe you could store each window in one file with an event index, and the file schema would be:

window = pa.schema(
    [
        ("subject_id", pa.int64()),
        ("start_time", pa.timestamp("us")),
        ("end_time", pa.bool_()),
    ]
)
mmcdermott commented 2 months ago

I don't think we need a formal pa schema for this extra information -- we can just use the old output format with the structs, right?