aten::_native_batch_norm_legit.no_stats_out Check failed (training == false): Portable kernels only support inference mode!

🐛 Describe the bug

it's happened when i run executor_runner with my pte model. it seems failed on torch.nn.BatchNorm2d op. here is the tracks:

./cmake-out/executor_runner --model_path 'model.pte'
I 00:00:00.007307 executorch:executor_runner.cpp:82] Model file model.pte is loaded.
I 00:00:00.007337 executorch:executor_runner.cpp:91] Using method forward
I 00:00:00.007341 executorch:executor_runner.cpp:138] Setting up planned buffer 0, size 81246208.
I 00:00:00.061899 executorch:executor_runner.cpp:161] Method loaded.
I 00:00:00.062504 executorch:executor_runner.cpp:171] Inputs prepared.
E 00:00:09.373685 executorch:op_native_batch_norm.cpp:200] Check failed (training == false): Portable kernels only support inference mode!
E 00:00:09.373724 executorch:method.cpp:1045] KernelCall failed at instruction 0:6733 in operator aten::_native_batch_norm_legit.no_stats_out: 0x12
E 00:00:09.373727 executorch:method.cpp:1051] arg 0 with type id 1
E 00:00:09.373729 executorch:method.cpp:1051] arg 1 with type id 1
E 00:00:09.373730 executorch:method.cpp:1051] arg 2 with type id 1
E 00:00:09.373732 executorch:method.cpp:1051] arg 3 with type id 5
E 00:00:09.373733 executorch:method.cpp:1051] arg 4 with type id 3
E 00:00:09.373734 executorch:method.cpp:1051] arg 5 with type id 3
E 00:00:09.373735 executorch:method.cpp:1051] arg 6 with type id 1
E 00:00:09.373737 executorch:method.cpp:1051] arg 7 with type id 1
E 00:00:09.373738 executorch:method.cpp:1051] arg 8 with type id 1
E 00:00:09.373746 executorch:method.cpp:1051] arg 9 with type id 9
F 00:00:09.373748 executorch:executor_runner.cpp:179] In function main(), assert failed (status == Error::Ok): Execution of method forward failed with status 0x12
Aborted (core dumped)

i am sure that i have called model.eval() functions before export model. here is my code for exporting model:

model = Model().eval()
sample = (torch.rand(1, 8, 512, 81),)
with torch.no_grad():
    aten_dialect = export(model, sample)
    edge_program = to_edge(aten_dialect)
    executorch_program = edge_program.to_executorch()
    with open("model.pte", "wb") as file:
       executorch_program.write_to_file(file)`

Versions

Collecting environment information...
PyTorch version: 2.5.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-2ubuntu1~18.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.24.0
Libc version: glibc-2.31

Python version: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-54-generic-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7K62 48-Core Processor
Stepping:            0
CPU MHz:             2595.114
BogoMIPS:            5190.22
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat

Versions of relevant libraries:
[pip3] executorch==0.4.0
[pip3] numpy==1.21.3
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] torch==2.5.0
[pip3] torchaudio==2.5.0
[pip3] torchsr==1.0.4
[pip3] torchvision==0.20.0
[pip3] triton==3.1.0
[conda] executorch                0.4.0                    pypi_0    pypi
[conda] numpy                     1.21.3                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] torch                     2.5.0                    pypi_0    pypi
[conda] torchaudio                2.5.0                    pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchvision               0.20.0                   pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi

i found the torch.nn._BatchNorm op's forward function:

class _BatchNorm(_NormBase):
    def __init__(
        self,
        num_features: int,
        eps: float = 1e-5,
        momentum: float = 0.1,
        affine: bool = True,
        track_running_stats: bool = True,
        device=None,
        dtype=None
    ) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__(
            num_features, eps, momentum, affine, track_running_stats, **factory_kwargs
        )

    def forward(self, input: Tensor) -> Tensor:
        self._check_input_dim(input)

        ...

        Decide whether the mini-batch stats should be used for normalization rather than the buffers.
        Mini-batch stats are used in training mode, and in eval mode when buffers are None.
        if self.training:
            bn_training = True
        else:
            bn_training = (self.running_mean is None) and (self.running_var is None)

        Buffers are only updated if they are to be tracked and we are in training mode. Thus they only need to be
        passed when the update should occur (i.e. in training mode when they are tracked), or when buffer stats are
        used for normalization (i.e. in eval mode when buffers are not None).
        """
        return F.batch_norm(
            input,
            # If buffers are not to be tracked, ensure that they won't be updated
            self.running_mean
            if not self.training or self.track_running_stats
            else None,
            self.running_var if not self.training or self.track_running_stats else None,
            self.weight,
            self.bias,
            bn_training,
            exponential_average_factor,
            self.eps,
        )

this line bn_training = (self.running_mean is None) and (self.running_var is None) may cause above problem.

it seems aten::instance_norm without running_mean weight and running_var weight can be regarded traning mode.

I always thought the problem was with torch.nn.BatchNorm2d op, but now I think it was the problem with aten::instance_norm.

pytorch / executorch

aten::_native_batch_norm_legit.no_stats_out Check failed (training == false): Portable kernels only support inference mode! #6632

🐛 Describe the bug

Versions