Fix scoring bug, properly handeling `nan` values

When computing our benchmark scores, we want to "ignore" runs on a base workload, if the submission doesn't hit the target on the held-out workload. This is implemented here: https://github.com/mlcommons/algorithmic-efficiency/blob/c465e252c95521c223530b0523feaa38c6dd06e4/scoring/performance_profile.py#L322-L328 However, the variant_criteria_filter() only checks for np.inf values (https://github.com/mlcommons/algorithmic-efficiency/blob/c465e252c95521c223530b0523feaa38c6dd06e4/scoring/performance_profile.py#L245-L257). But another invalid score that can occur is a nan. This happens, e.g. when running OOM. In this case, the base workload score should also be ignored.

This PR fixes this issue. To properly do so, it also needs to load the list of held-out workloads (to drop all other workload variants that have only been computed for the baseline).

mlcommons / algorithmic-efficiency

Fix scoring bug, properly handeling `nan` values #780