microsoft / mssql-docker

Official Microsoft repository for SQL Server in Docker resources
MIT License
1.72k stars 759 forks source link

Container keeps crashing with "Stack Overflow" #890

Closed fuero closed 2 months ago

fuero commented 2 months ago

The container image crashes continuously with "Stack Overflow"

OS: Debian Bookworm Image: mcr.microsoft.com/mssql/server:2022-CU13-ubuntu-22.04@sha256:c4369c38385eba011c10906dc8892425831275bb035d5ce69656da8e29de50d8 Kernel: 6.1.0-22-amd64 Tested as well with 6.6.13+bpo, 6.7.12+bpo - same result, with the -cloud variants as well. k3s version v1.30.2+k3s1 (aa4794b3) Tested with an empty database Same deployment runs fine on Debian 11 with kernel

2024-07-17 13:07:05.52 Server      Setup step is FORCE copying system data file 'C:\templatedata\model_replicatedmaster.mdf' to '/var/opt/mssql/data/model_replicatedmaster.mdf'.
2024-07-17 13:07:05.55 Server      Setup step is FORCE copying system data file 'C:\templatedata\model_replicatedmaster.ldf' to '/var/opt/mssql/data/model_replicatedmaster.ldf'.
2024-07-17 13:07:05.55 Server      Setup step is FORCE copying system data file 'C:\templatedata\model_msdbdata.mdf' to '/var/opt/mssql/data/model_msdbdata.mdf'.
2024-07-17 13:07:05.58 Server      Setup step is FORCE copying system data file 'C:\templatedata\model_msdblog.ldf' to '/var/opt/mssql/data/model_msdblog.ldf'.
2024-07-17 13:07:05.65 Server      Microsoft SQL Server 2022 (RTM-CU13) (KB5036432) - 16.0.4125.3 (X64)
May  1 2024 15:05:56 
Copyright (C) 2022 Microsoft Corporation
Developer Edition (64-bit) on Linux (Ubuntu 22.04.4 LTS) <X64>
2024-07-17 13:07:05.65 Server      UTC adjustment: 0:00
2024-07-17 13:07:05.65 Server      (c) Microsoft Corporation.
2024-07-17 13:07:05.65 Server      All rights reserved.
2024-07-17 13:07:05.65 Server      Server process ID is 688.
2024-07-17 13:07:05.65 Server      Logging SQL Server messages in file '/var/opt/mssql/log/errorlog'.
2024-07-17 13:07:05.65 Server      Registry startup parameters:
 -d /var/opt/mssql/data/master.mdf
 -l /var/opt/mssql/data/mastlog.ldf
 -e /var/opt/mssql/log/errorlog
2024-07-17 13:07:05.66 Server      SQL Server detected 1 sockets with 32 cores per socket and 32 logical processors per socket, 32 total logical processors; using 32 logical processors based on SQL Server licensing. This is an informational message; no user action is required.
2024-07-17 13:07:05.66 Server      SQL Server is starting at normal priority base (=7). This is an informational message only. No user action is required.
2024-07-17 13:07:05.66 Server      Detected 103052 MB of RAM. This is an informational message; no user action is required.
2024-07-17 13:07:05.66 Server      Using conventional memory in the memory manager.
2024-07-17 13:07:05.66 Server      Detected pause instruction latency: 87 cycles.
2024-07-17 13:07:05.66 Server      SQL Server detected the following NUMA node configuration (NUMA Node number 0, Processor Group number 0, CPU Mask 0x00000000ffffffff).
2024-07-17 13:07:05.67 Server      Page exclusion bitmap is enabled.
2024-07-17 13:07:05.77 Server      Buffer pool extension is not supported on Linux platform.
2024-07-17 13:07:05.77 Server      Buffer Pool: Allocating 16777216 bytes for 16097280 hashPages.
2024-07-17 13:07:05.92 Server      Default collation: SQL_Latin1_General_CP1_CI_AS (us_english 1033)
2024-07-17 13:07:05.92 Server      Automatic soft-NUMA was enabled because SQL Server has detected hardware NUMA nodes with greater than 8 physical cores.
2024-07-17 13:07:06.02 Server      Buffer pool extension is already disabled. No action is necessary.
2024-07-17 13:07:06.43 Server      Installing Client TLS certificates to the store.
2024-07-17 13:07:06.44 Server      CPU vectorization level(s) detected:  SSE SSE2 SSE3 SSSE3 SSE41 SSE42 AVX AVX2 POPCNT BMI1 BMI2
2024-07-17 13:07:06.51 Server      Successfully initialized the TLS configuration. Allowed TLS protocol versions are ['1.0 1.1 1.2']. Allowed TLS ciphers are ['ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:AES128-SHA:!DHE-RSA-AES256-GCM-SHA384:!DHE-RSA-AES128-GCM-SHA256:!DHE-RSA-AES256-SHA:!DHE-RSA-AES128-SHA'].
2024-07-17 13:07:06.54 Server      Query Store settings initialized with enabled = 1,
2024-07-17 13:07:06.56 Server      The maximum number of dedicated administrator connections for this instance is '1'
2024-07-17 13:07:06.57 Server      Node configuration: node 0: CPU mask: 0x00000000000000ff:0 Active CPU mask: 0x00000000000000ff:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
2024-07-17 13:07:06.57 Server      Node configuration: node 1: CPU mask: 0x000000000000ff00:0 Active CPU mask: 0x000000000000ff00:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
2024-07-17 13:07:06.57 Server      Node configuration: node 2: CPU mask: 0x0000000000ff0000:0 Active CPU mask: 0x0000000000ff0000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
2024-07-17 13:07:06.57 Server      Node configuration: node 3: CPU mask: 0x00000000ff000000:0 Active CPU mask: 0x00000000ff000000:0. This message provides a description of the NUMA configuration for this computer. This is an informational message only. No user action is required.
2024-07-17 13:07:06.64 Server      Using dynamic lock allocation.  Initial allocation of 2500 Lock blocks and 5000 Lock Owner blocks per node.  This is an informational message only.  No user action is required.
2024-07-17 13:07:06.64 Server      Lock partitioning is enabled.  This is an informational message only. No user action is required.
2024-07-17 13:07:06.68 Server      CLR version v4.0.30319 loaded.
2024-07-17 13:07:06.70 Server      In-Memory OLTP initialized on standard machine.
2024-07-17 13:07:06.79 Server      [INFO] Created Extended Events session 'hkenginexesession'
2024-07-17 13:07:06.80 Server      Database Instant File Initialization: enabled. For security and performance considerations see the topic 'Database Instant File Initialization' in SQL Server Books Online. This is an informational message only. No user action is required.
ForceFlush is enabled for this instance. 
2024-07-17 13:07:06.81 Server      Total Log Writer threads: 4. This is an informational message; no user action is required.
2024-07-17 13:07:06.82 Server      clwb is selected for pmem flush operation.
2024-07-17 13:07:06.82 Server      Software Usage Metrics is disabled.
2024-07-17 13:07:06.84 spid44s     [1]. Feature Status: PVS: 0. CTR: 0. ConcurrentPFSUpdate: 1. ConcurrentGAMUpdate: 1. ConcurrentSGAMUpdate: 1, CleanupUnderUserTransaction: 0. TranLevelPVS: 0
2024-07-17 13:07:06.84 spid44s     Starting up database 'master'.
2024-07-17 13:07:06.88 spid44s     There have been 256 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file /var/opt/mssql/data/master.mdf.
{
    "reason": "0x00000006",
    "processName": "sqlservr",
    "pid": "9",
    "instanceId": "f5f32f6d-acf7-4760-a37d-30de55e2667b",
    "crashId": "669a5348-cb9e-4179-ba54-a4b07043289d",
    "threadState": "0x00007f7e97b5e630",
    "threadId": "292",
    "libosThreadId": "0x45c",
    "buildStamp": "9da4a512b2a09348c40cfe9ff7f072a7fa572ee5a1a90da8524b94fc4dd72186",
    "versions": [
        {
            "name": "",
            "buildId": "9da4a512b2a09348c40cfe9ff7f072a7fa572ee5a1a90da8524b94fc4dd72186",
            "buildType": "release",
            "src": "bbb8f17",
            "date": "Thu May 02 18:15:24 GMT 2024",
            "buildNum": "267480"
        },
        {
            "name": "PAL",
            "buildId": "302de0ccaa18002bf3c5926b620d648e3e8b5dc97e796364460a633caad33a25",
            "buildType": "release",
            "src": "a9fa1e3",
            "date": "Thu May 02 18:11:21 GMT 2024",
            "buildNum": "267480"
        }
    ],
    "message": "Stack Overflow",
    "address": "0x3fff9561e08b",
    "stackTrace": [
        "file://package6/windows/system32/sqlpal.dll+0x0000000000209C12",
        "file://package6/windows/system32/sqlpal.dll+0x0000000000208156",
        "file://package6/windows/system32/sqlpal.dll+0x0000000000208272",
        "file://package6/windows/system32/sqlpal.dll+0x000000000021E08B",
        "file://package6/windows/system32/sqlpal.dll+0x000000000021E283",
        "file://package6/windows/system32/sqlpal.dll+0x00000000003A0569"
    ],
    "stackModules": [
        "file://package6/windows/system32/sqlpal.dll=be2c87c739c4b608f1f03f3e59877de11,D:\\dbs\\sh\\9h43\\0502_103623\\LibOS\\Windows\\Kernel\\Init\\regular\\obj\\amd64\\sqlpal.pdb"
    ],
    "last_errno": "22",
    "last_errno_text": "Invalid argument",
    "distribution": "Ubuntu 22.04.4 LTS",
    "processors": "32",
    "total_memory": "135071617024",
    "timestamp": "Wed Jul 17 13:07:40 2024"
}
fuero commented 2 months ago

So after a "little" more digging, I found the cause - the volume mount's block size.

This message was the dead giveaway, masquerading as an information level message:

There have been 256 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file /var/opt/mssql/data/master.mdf.

see Technet on this issue.

My Kubernetes cluster's CSI driver provisioned block devices with 16K block size, but SQL Server apparantly insists on block sizes <= 4K.

Reconfiguring the CSI driver for this particular PVC made the error vanish.