MEDS-DEV Benchmark evaluation

[!CAUTION] This is a work-in-progress evaluation package for the MEDS-DEV benchmarking effort.

meds-evaluation label schema has five mandatory fields:

subject_id: The ID of the subject this event is about.
prediction_time: The time at which we are making a prediction for the patient.
boolean_value: The ground truth boolean label for the prediction task.
predicted_boolean_value: The predicted boolean label generated by a model.
predicted_boolean_probability: The predicted probability logits generated by a model.

Models, when predicting this boolean_value label, are allowed to use all data about a subject up to and including the prediction_time.

The following pyarrow schema is expected by the meds-evaluation pipeline:

predicted_labels = pa.schema(
    [
        ("subject_id", pa.int64()),
        ("prediction_time", pa.timestamp("us")),
        ("boolean_value", pa.bool_()),
        ("predicted_boolean_value", pa.bool_()),
        ("predicted_boolean_probability", pa.float64()),
    ]
)

PredictedLabel = TypedDict("Label", {
    "subject_id": int,
    "prediction_time": datetime.datetime,
    "boolean_value": bool,
    "predicted_boolean_value": bool,
    "predicted_boolean_probability": float,
}, total=False)

kamilest / meds-evaluation

readme

MEDS-DEV Benchmark evaluation