Submission checker: better modularize the code, and have better documentation of the expectation of results

nvzhihanj commented 6 months ago

WIth the increasing number of benchmarks and checks, we have found several issues with the submission checker (https://github.com/mlcommons/inference/blob/master/tools/submission/submission_checker.py):

The file itself is too long (>3700 in loc), which makes it hard to read and interpret
The logic of checking is not very well maintained, making it a bit hard to onboard new submitters
Some corner cases are hard to debug, due to the lack of clear warning/error message and exception handling.

We would like to propose several changes to the submission checker, including but not limited to:

Avoid duplicate code, and cache configs for old submission in separate files (because we don't need backward compatibility)
Code styles and structures improvement. For this @nv-alicheng can provide suggestions on adapting modern Python to the submission checker
Add good documentation and exception/error handling, so debugging for the users is easier.

One concrete example of uncaught error is: if users submit runs lower than 600 seconds, the submission checker will report ERROR in the log and return non-zero, but at the end of the log it will still say "the results look OK".

@pgmpablo157321 FYI. We are discussing how to better improve the submission_checker, and will continue to update in this thread.

arjunsuresh commented 5 months ago

Agree with old configs - when running the submission checker on old submissions it makes sense to use the old release of the submission checker. We probably need to maintain only the latest submission and the upcoming submission compatibility in the submission checker. This itself will reduce the submission checker size considerably and this PR does a first iteration for this.

This PR adds a GitHub action for the submission checker changes and runs it against the v4.0 results repository - once this is merged we can safely do any change in the submission checker.

One concrete example of uncaught error is: if users submit runs lower than 600 seconds, the submission checker will report ERROR in the log and return non-zero, but at the end of the log it will still say "the results look OK".

This issue should already be fixed in the master branch by this PR

nv-alicheng commented 5 months ago

Proposal for Submission Checker Refactor

MODEL_CONFIG is well over 1000 lines long, and is stored as an enormous nested dictionary. It is also extremely hard to figure out what the schema of this dictionary is without reading the contents.

Proposal:

MODEL_CONFIG should be stored in separate directories (each corresponding to an MLPINF version), each with a common file structure

Proposed structure:

schema.py
constants.py  # Maybe contains scenarios enum?
vX.Y/
    config.py
    models.py

MODEL_CONFIG should have a definitive schema that is easily understandable. The simplest way to do this I believe is with Python dataclasses
MODEL_CONFIG must be flexible enough to allow for small variations that may occur between MLPINF submission rounds
Proposed design:

schema.py

@dataclass
class MLPInferenceSubmission:
    version: str
    models: List[Model]  # Alternatively could be a Dict[str, Model]
    seeds: Seeds
    audit_test_seeds: Dict[AuditTest, Seeds] 
    ignored_errors: List[str]

@dataclass
class Model:
    name: str
    aliases: List[str]
    required_in: Dict[SystemCategory, List[Scenario]]
    optional_in: Dict[SystemCategory, List[Scenario]]
    accuracy_metric: AccuracyMetric  # Defined in constants.py, has regex field
    accuracy_target: float
    performance_sample_count: int
    scenario_configs: Dict[Scenario, ScenarioConfig]  # With this, it is easy to do things like override the result field for Llama2 since you can use a different string.
    audit_tests: List[AuditTest]

@dataclass
class ScenarioConfig:
    latency_constraint: int = 0
    min_query_count: int = 1
    result_field: ResultField = None

@dataclass
class Seeds:
    qsl: int
    sample_index: int
    schedule: int

@dataclass
class ResultField:
    id_: str
    raw_string: str

Example: v4.0:

models.py:

from ..schema import *
from ..scenarios import *

ResNet50 = Model(
    name="resnet",
    aliases=["resnet50"],
    required_in={SystemCategory.Datacenter: [Scenario.Offline, Scenario.Server],
                 SystemCategory.Edge: [Scenario.Offline, Scenario.MultiStream, Scenario.SingleStream]},
    optional_in=None,  # Can be omitted if dataclass definition uses purely kwargs
    accuracy_metric=AccuracyMetric.Accuracy,
    accuracy_target=76.46 * 0.99,
    performance_sample_count=1024,
    latency_constraints={Scenario.Server: 15000000},
    min_query_counts={Scenario.Offline: 1,
                      Scenario.Server: 270336,
                      Scenario.SingleStream: 1024,
                      Scenario.MultiStream: 270336},
    scenario_configs={
        Scenario.Offline: ScenarioConfig(min_query_count=1,
                                         result_field=ResultField("result_samples_per_second", "Samples per second")),
        Scenario.Server: ScenarioConfig(latency_constraint=15000000,
                                        min_query_count=270336,
                                        result_field=ResultField("result_scheduled_samples_per_sec", "Scheduled samples per second")),
        Scenario.SingleStream: ...},
    audit_tests=[AuditTest.Test01, AuditTest.Test04, AuditTest.Test05]
)

... etc for all models

config.py:

from ..schema import *
from ..scenarios import *
from .models import *

v4_0_SubmissionConfig = MLPInferenceSubmission(
    version="v4.0",
    models=[ResNet50, Retinanet, ...],
    seeds=Seeds(13281865557512327830, 198141574272810017, 7575108116881280410),
    audit_test_seeds={AuditTest.Test05: Seeds(2376919268182438552, 11176391829184272374, 3911940905271271337)},
    ignored_errors=[])

There is an enormous amount of checks and conditions that need to be satisfied, and the loop is quite hard to read and interpret given its size. I don't have a definite proposal for this since the code for actually checking is quite long and I haven't fully groked through the whole thing, but it can probably be made much simpler and readable by splitting each "phase" of the checker into individual files, and making each individual fail condition its own function (kind of like a PyTest file structure).

For example:

performance_dir_check.py:

class PerformanceDirCheck(Checker):
    def fail_if_missing_files(self, ctx):
        ...
    def fail_if_contains_loadgen_error(self, ctx):
        ...
    def fail_if_insufficient_performance_sample_count(self, ctx):
        ...

Where Checker is a class that has a method that runs all member methods that begin with "failif", and ctx contains the necessary information about the workload being checked.

mrmhodak commented 4 months ago

@pgmpablo157321 to have a look

arjunsuresh commented 4 months ago

PR removing the backward compatibility of the submission checker: https://github.com/mlcommons/inference/pull/1677

nv-ananjappa commented 4 months ago

We should get back to this issue after https://github.com/mlcommons/inference/pull/1677 is merged. That PR would simplify the submission checker a lot.

pgmpablo157321 commented 1 day ago

@nv-ananjappa @nvzhihanj @arjunsuresh @mrmhodak Proposal: Main modules:

Package checker: check that all the require files are there, check for forbidden files
Log checkers:
- Performance checker: check the content of the performance logs
- Accuracy checker: check the content of the performance logs
- Compliance checker: check the compliance tests
Result summarizer: Compute the summary of the results

Helper modules:

Log Parser
Configuration parser

Additional items:

Configuration files

arjunsuresh commented 1 day ago

From chatgpt :)

To modularize this script, I’ll refactor the file into logical components, grouping functions, constants, and class definitions into separate modules. Here's how the code can be broken down:

Folder Structure:

submission_checker/
│
├── __init__.py
├── main.py
├── config.py
├── utils.py
├── constants.py
├── checkers/
│   ├── __init__.py
│   ├── accuracy.py
│   ├── performance.py
│   ├── power.py
│   ├── system_desc.py
│   └── compliance.py

1. `config.py`

This module will contain the Config class that handles configuration for different versions of MLPerf and submission types.

# config.py

from .constants import MODEL_CONFIG

class Config:
    """Select config value by mlperf version and submission type."""

    def __init__(self, version, extra_model_benchmark_map, ignore_uncommited=False, skip_power_check=False):
        self.base = MODEL_CONFIG.get(version)
        self.extra_model_benchmark_map = extra_model_benchmark_map
        self.version = version
        self.models = self.base["models"]
        self.seeds = self.base["seeds"]
        self.test05_seeds = self.base["test05_seeds"]
        self.accuracy_target = self.base["accuracy-target"]
        self.accuracy_delta_perc = self.base["accuracy-delta-perc"]
        self.accuracy_upper_limit = self.base.get("accuracy-upper-limit", {})
        self.performance_sample_count = self.base["performance-sample-count"]
        self.latency_constraint = self.base.get("latency-constraint", {})
        self.min_queries = self.base.get("min-queries", {})
        self.required = None
        self.optional = None
        self.ignore_uncommited = ignore_uncommited
        self.skip_power_check = skip_power_check

    def set_type(self, submission_type):
        """Set required and optional scenarios based on the submission type."""
        if submission_type == "datacenter":
            self.required = self.base["required-scenarios-datacenter"]
            self.optional = self.base["optional-scenarios-datacenter"]
        elif submission_type == "edge":
            self.required = self.base["required-scenarios-edge"]
            self.optional = self.base["optional-scenarios-edge"]
        elif submission_type in ["datacenter,edge", "edge,datacenter"]:
            self.required = self.base["required-scenarios-datacenter-edge"]
            self.optional = self.base["optional-scenarios-datacenter-edge"]
        else:
            raise ValueError("Invalid system type")

    # Other methods of Config go here...

2. `constants.py`

This module will contain all constants such as MODEL_CONFIG, VALID_DIVISIONS, and REQUIRED_FILES.

# constants.py

MODEL_CONFIG = {
    "v4.0": {
        # Same structure you had in your original script
    },
    "v4.1": {
        # Another version configuration
    }
}

VALID_DIVISIONS = ["open", "closed", "network"]
VALID_AVAILABILITIES = ["available", "preview", "rdi"]

REQUIRED_PERF_FILES = ["mlperf_log_summary.txt", "mlperf_log_detail.txt"]
OPTIONAL_PERF_FILES = ["mlperf_log_accuracy.json"]

# Add all other constants from the original script here...

3. `utils.py`

This module will hold utility functions that are used throughout the codebase, such as directory operations and file listing.

# utils.py

import os

def list_dir(*path):
    path = os.path.join(*path)
    return [f for f in os.listdir(path) if os.path.isdir(os.path.join(path, f))]

def list_files(*path):
    path = os.path.join(*path)
    return [f for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]

def list_files_recursively(*path):
    path = os.path.join(*path)
    return [
        os.path.join(dirpath, file)
        for dirpath, dirs, files in os.walk(path)
        for file in files
    ]

# Add any other file/directory-related utility functions here...

4. `checkers/accuracy.py`

This module will contain the functions related to checking accuracy.

# checkers/accuracy.py

import os
import re
from ..utils import list_files
from ..constants import ACC_PATTERN, MAX_ACCURACY_LOG_SIZE

def check_accuracy_dir(config, model, path, verbose):
    """Check accuracy files and logs."""
    is_valid = False
    all_accuracy_valid = True
    acc = None
    result_acc = {}
    hash_val = None
    target = config.get_accuracy_target(model)
    acc_upper_limit = config.get_accuracy_upper_limit(model)

    # Parsing and checking the accuracy
    # The rest of the check_accuracy_dir function remains unchanged

    return is_valid, result_acc

5. `checkers/performance.py`

This module will handle performance checking functions.

# checkers/performance.py

from ..constants import MS_TO_NS, S_TO_MS
from log_parser import MLPerfLog

def check_performance_dir(config, model, path, scenario_fixed, division, system_json, has_power=False):
    """Check performance directory and logs."""
    is_valid = False
    rt = {}

    # Load MLPerf log and validate
    mlperf_log = MLPerfLog(os.path.join(path, "mlperf_log_detail.txt"))

    # Continue with performance check logic...

    return is_valid, rt

6. `checkers/power.py`

This module will deal with checking power directories and logs.

# checkers/power.py

from log_parser import MLPerfLog
import os

def check_power_dir(power_path, ranging_path, testing_path, scenario_fixed, power_res_ranging, power_res_testing, config):
    """Check power metrics from power logs."""
    is_valid = True
    power_metric = 0

    # Check and validate power files...

    return is_valid, power_metric

7. `checkers/system_desc.py`

This module will include functions for checking system descriptions.

# checkers/system_desc.py

import os
import json
from ..constants import SYSTEM_DESC_REQUIRED_FIELDS

def check_system_desc_id(fname, systems_json, submitter, division, version, skip_meaningful_fields_emptiness_check):
    """Check system description file."""
    is_valid = True

    # Perform system description validation...

    return is_valid

8. `checkers/compliance.py`

This module will handle compliance-related checks.

# checkers/compliance.py

import os
from ..utils import list_files

def check_compliance_dir(compliance_dir, model, scenario, config, division, system_json, name):
    """Check compliance test directories and logs."""
    compliance_perf_pass = True

    # Perform compliance checks...

    return compliance_perf_pass

9. `main.py`

Finally, the main module will orchestrate the checking process by utilizing all the modularized functions and components.

# main.py

import argparse
from config import Config
from utils import list_dir, list_files
from checkers.accuracy import check_accuracy_dir
from checkers.performance import check_performance_dir
from checkers.power import check_power_dir
from checkers.system_desc import check_system_desc_id
from checkers.compliance import check_compliance_dir

def get_args():
    """Parse commandline arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", required=True, help="submission directory")
    parser.add_argument("--version", default="v4.1", choices=["v4.0", "v4.1"], help="mlperf version")
    parser.add_argument("--submitter", help="filter to submitter")
    parser.add_argument("--csv", default="summary.csv", help="csv file with results")
    args = parser.parse_args()
    return args

def main():
    args = get_args()
    config = Config(args.version, "model_mapping.json")

    # Rest of the logic for running the checks using modularized functions...

if __name__ == "__main__":
    main()

Explanation:

Modularization: The file has been split into logical units for configuration, utilities, constants, and specific checkers (accuracy, performance, power, etc.).
Reuse: Each checker module focuses on a specific aspect, making the code easier to maintain and extend in the future.
Main Orchestration: main.py coordinates everything by calling the appropriate functions from the modules.

This modularization makes the codebase easier to manage, debug, and extend for future versions of the MLPerf submission checker.

mlcommons / inference