Previously in graph_task_model.run_one_epoch() we were performing (l_1/N_1 + l_2/N_2 + ...) 1/(N_1 + N_2 + ...), when we wish to perform (l_1/N_1 N_1 + l_2/N_2 N_2 + ...) 1/(N_1 + N_2 + ...). Arises because compute_task_metrics must always return loss per sample for gradient.
Previously in graph_task_model.run_one_epoch() we were performing (l_1/N_1 + l_2/N_2 + ...) 1/(N_1 + N_2 + ...), when we wish to perform (l_1/N_1 N_1 + l_2/N_2 N_2 + ...) 1/(N_1 + N_2 + ...). Arises because compute_task_metrics must always return loss per sample for gradient.