Closed nalzok closed 2 years ago
Hi nalzok,
Thank you for your question. In our paper we wanted to investigate how well models performed on different subgroups and so by assuming the attributes are uniformly distributed, all groups contribute similarly to the overall performance. However, as you say, for many applications this is a simplification of what is true in practice (e.g. that certain subgroups may be harder to collect data for). Our framework can be used to simulate those evaluation frameworks by manipulating the validation/test distribution and further analysis on those results could be done to determine the quality of the model across different subgroups. We did not explore this but it would be an interesting study. I hope that is helpful.
Olivia
Thanks for the response, Olivia. Is there any chance you can share the test accuracy for each subgroup, as opposed to the averaged overall accuracy? We can study arbitrary joint distributions of attributes with that data, e.g. the following setting
Hi, unfortunately we don't have this. However, the evaluation code will dump all the results at the end of the training (along with features and labels) into results.pkl
so you can easily compute this for models that you train.
Great. Thank you!
Hi, I wonder why you assume all the attributes are uniformly distributed on the test set? Specifically, you wrote in section 2.2 that
Despite being desirable from a theoretical point of view, this does not seem very realistic. For example, the types of equipment may not distribute equally across all hospitals, the proportion of patients with a tumor is not necessary 50%, and we are forced to consider pregnant men if "sex" and "pregnancy" are two of the attributes.