compute the average for each category at run time for a given method and game combo
average these category averages together to create an overall score
The following items are added to acc_dict and f1_dict:
"overall_avg": just the average over all probes
"_avg" : for example "agent_localization_avg", which is the average accuracy of every state variable, which is considered an agent_localization state variable
"across_categories_avg": the average of all category averages
This pull request will:
The following items are added to acc_dict and f1_dict: