mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.24k stars 536 forks source link

Nvidia 4.1 inference code is giving segmentation fault for RTX 4090 (4.0 code works fine) #1847

Open arjunsuresh opened 2 months ago

arjunsuresh commented 2 months ago

Trying to run Nvidia v4.1 implementation for stable diffusion on RTX 4090.

(mlperf) arjun@mlperf-inference-arjun-x86-64-24944:/work$ make generate_engines RUN_ARGS="--benchmarks=stable-diffusion-xl --scenarios=Offline"
make: *** [Makefile:37: generate_engines] Segmentation fault (core dumped)
make download_model BENCHMARKS="stable-diffusion-xl"

ran successfully and produced int8 model. Below are the custom configs used for 2x RTX4090.

class SPR(OfflineGPUBaseConfig):
    system = KnownSystem.spr

    # Applicable fields for this benchmark are listed below. Not all of these are necessary, and some may be defined in the BaseConfig already and inherited.
    # Please see NVIDIA's submission config files for example values and which fields to keep.
    # Required fields (Must be set or inherited to run):
    gpu_batch_size = {'clip1': 32 * 2, 'clip2': 32 * 2, 'unet': 32 * 2, 'vae': 1}
    offline_expected_qps: float = 1.0
    precision: str = 'int8'
Oseltamivir commented 1 month ago

Hi,

May I ask if it is possible to test this on fp32 model?

Specifically, change precision: str = 'fp16' in the custom config and change SDXLVAEBuilder in code/stable-diffusion-xl/tensorrt/builder.py to the following:

class SDXLVAEBuilder(SDXLBaseBuilder,
                     ArgDiscarder):
    """SDXL VAE builder class.
    """

    def __init__(self,
                 *args,
                 component_name: str,
                 batch_size: int,
                 model_path: PathLike,
                 **kwargs):
        vae_precision = 'fp32'
        vae_path = model_path + "onnx_models/vae/model.onnx"
        strongly_typed = False
        super().__init__(*args,
                         model=VAE(name=component_name, max_batch_size=batch_size, precision=vae_precision, device='cuda'),
                         model_path=vae_path,
                         batch_size=batch_size,
                         strongly_typed=strongly_typed,
                         use_native_instance_norm=True,
                         **kwargs)

then make generate_engines RUN_ARGS="--benchmarks=stable-diffusion-xl --scenarios=Offline"

I was unable to build the int8 models on RTX3090 but fp16/32 models worked and got them running.

arjunsuresh commented 1 month ago

That's great. But the same code didn't work for me - still getting segmentation fault for fp16/fp32.

Oseltamivir commented 4 weeks ago

That's strange.... personally i think the error is due to the model but being quantized/exported properly or the data was not properly preprocessed.

If you still have the commands history, is it possible for you to send all make command here? Thanks

arjunsuresh commented 4 weeks ago

@Oseltamivir For me, the below line is the culprit. Commenting it out makes it work for me - though I haven't checked everything. I'll also try on different systems.

https://github.com/mlcommons/inference_results_v4.1/blob/main/closed/NVIDIA/code/actionhandler/calibrate.py#L24

Oseltamivir commented 4 weeks ago

In that case it might be a issue caused by mitten.

But the maintainers don't seem to respond to issues posted in the mitten repo. I opened an issue there 3 months ago but got no reply. I ended up having to email Yiheng to ask about their implementation of mitten.

arjunsuresh commented 4 weeks ago

yes, it is. We'll come back to it as currently we are using Nvidia v4.0 code to collect the inference results via github actions.