Open Jeanselme opened 3 months ago
I think this is a great idea, I was thinking more concretely this could be realized with 2 parquet files, one with model predictions/outputs and one with the patient subgroup assignments at different prediction times:
categorical_value
field filled out indicating the subgroup a patient is in. Nice part: The nice thing about having a separate file for subgroup assignments is that you can generate many different subgroup assignments and run your evaluation script only changing the path to the subgroup assignment file. This allows evaluating many different subgroups.
Complex not so nice part:
The Model outputs have a subject_id
and prediction_time
, and you want the subgroup assignments parquet (which also has subject_id
and prediction_time
) to align with this, so you can just do a join operation on those two columns between the model outputs and subgroup assignments. This may be challenging to deal with if they don't align. Do you want to do a polars join_asof operation (instead of a join) so for each row in the model_outputs parquet, you take the most recent prior row in the subgroup assignments parquet, and if there is no prior row for the subject, you just enter a null subgroup assignment?
Thoughts @Jeanselme @kamilest @mmcdermott @abinithago
Instead of using join_asof, we could also modify the aces task yaml file that generated the task labels and define a subgroup predicate within the aces task yaml file such that you get both task labels in the boolean_value
field and subgroup identities in the categorical_value
field. From both these fields, we can use join operation to calculate the task metrics for subgroup of interest.
I would support having separate files because when we extend the benchmark to support multi-class classification tasks we would need to reimplement a lot of the logic to deal with the overloaded categorical_value
field.
I don't think we should put subgroup identities in the categorical values field @abinithago -- for one, it sort of violates the assumptions of the schema, and for another, that data is actually independent of the task prediction step, and can be stored once per dataset rather than on every extracted task. I think the idea of taking the single model output (predictions on all samples for a given task) and splitting it into separate files per subgroup via some kind of join operation, potentially as @Oufattole proposed, is likely the right way to go here. @kamilest, is that what you were suggesting as well, or did I misunderstand your comment?
I believe we were more thinking about saving the different subgroup identities in separate file as you are suggesting. Then this will be an input to the evaluation function to compute the different metrics. I don’t think we should split the prediction files as the computation of some metrics might need access to the different subgroups.
For fairness metrics, we might want to have an additional vector and stratify to compute fairness. I believe we can start with simple group difference (when the given vector only contains two unique values) and all pair-wise group differences (when more than 2 groups).