Open mmcdermott opened 2 months ago
The use case I have in mind for additional data is that users may wish to extract windows of data for contrastive learning tasks. So I may wish to extract a window of data prior to and after an event (inpatient admissions for example). For each window you need a start and end time, and that is it right?
So the label_schema
is:
label = pa.schema(
[
("subject_id", pa.int64()),
("prediction_time", pa.timestamp("us")),
("boolean_value", pa.bool_()),
("integer_value", pa.int64()),
("float_value", pa.float64()),
("categorical_value", pa.string()),
]
)
It seems the intention of the label schema is for supervised tasks where you just need a prediction time and label, so it doesn't seem appropriate to add window information to that. I would advocate for an additional file with window-based data using a struct per window as you used in v0.3.2 of aces (I think).
Let's suppose for my example I use this config:
predicates:
admission:
code: ADMISSION
trigger: admission
windows:
pre:
start: null
end: trigger
start_inclusive: True
end_inclusive: False
post:
start: pre.end
end: null
start_inclusive: True
end_inclusive: True
Maybe you could store each window in one file with an event index, and the file schema would be:
window = pa.schema(
[
("subject_id", pa.int64()),
("start_time", pa.timestamp("us")),
("end_time", pa.bool_()),
]
)
I don't think we need a formal pa
schema for this extra information -- we can just use the old output format with the structs, right?
Options are:
label_schema
to include additional columns much as data does.