pg9182 / northstar-dedicated

Docker image for the Northstar dedicated server.
https://ghcr.io/pg9182/northstar-dedicated
zlib License
101 stars 17 forks source link

Servers intermittently failing to start #29

Closed pg9182 closed 2 years ago

pg9182 commented 2 years ago

Logs from instance finer-dormammu (457ea7c023e6):

--- snip ---
2022-05-11T22:05:54.485600351Z  nswrap: 3db9f64ef7797935daa47aba87436a10c7ce41d6a83ac378c364e22b2dc1ffc0
2022-05-11T22:05:54.485627286Z  nswrap: 
2022-05-11T22:05:54.485631616Z  nswrap: config
2022-05-11T22:05:54.485635274Z  nswrap:   PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2022-05-11T22:05:54.485638842Z  nswrap:   HOME=/home/northstar
2022-05-11T22:05:54.485642319Z  nswrap:   USER=(null)
2022-05-11T22:05:54.485645757Z  nswrap:   HOSTNAME=457ea7c023e6
2022-05-11T22:05:54.485649464Z  nswrap:   DISPLAY=xvfb
2022-05-11T22:05:54.485653175Z  nswrap:   WINEPREFIX=/home/northstar/.wine
2022-05-11T22:05:54.485656906Z  nswrap:   WINEDEBUG=(null)
2022-05-11T22:05:54.485660459Z  nswrap:   WINESERVER=(null)
2022-05-11T22:05:54.485664104Z  nswrap: 
2022-05-11T22:05:54.485667546Z  nswrap: system info:
2022-05-11T22:05:54.485671227Z  nswrap:   kernel: Linux 457ea7c023e6 5.4.0-84-generic #94-Ubuntu SMP Thu Aug 26 20:27:37 UTC 2021 x86_64
2022-05-11T22:05:54.485675010Z  nswrap:   processor: 4 cores
2022-05-11T22:05:54.485678715Z  nswrap:   memory: 8348397568 total, 4034752512 free, 2465792 shared, 30863360 buffer
2022-05-11T22:05:54.485682327Z  nswrap:   swap: 0 total, 0 free
2022-05-11T22:05:54.485686022Z  nswrap: 
2022-05-11T22:05:54.485859802Z  nswrap: using WINEDEBUG='fixme-secur32,fixme-bcrypt,fixme-ver,err-wldap32'
2022-05-11T22:05:54.485869053Z  nswrap: starting xvfb
2022-05-11T22:05:54.546217460Z  nswrap: xvfb started on display :0 with pid 16
2022-05-11T22:05:54.546255533Z  nswrap: starting wine
2022-05-11T22:05:54.546259554Z  nswrap: warning: xvfb terminated: exited with status 1
2022-05-11T22:05:54.546227555Z  (EE) 
2022-05-11T22:05:54.546295759Z  Fatal server error:
2022-05-11T22:05:54.546299809Z  (EE) Cannot write display number to fd 15
2022-05-11T22:05:54.546302400Z  (EE) 
2022-05-11T22:05:54.661885133Z  002c:err:wineboot:process_run_key Error running cmd L"C:\\windows\\system32\\winemenubuilder.exe -a -r" (2).
2022-05-11T22:05:55.588632860Z  [*] Loading stubs
2022-05-11T22:05:55.591654056Z  [*]   Loading GFSDK_SSAO.win64.dll
2022-05-11T22:05:55.594523452Z  [*]   Loading GFSDK_TXAA.win64.dll
2022-05-11T22:05:55.597491307Z  [*]   Loading d3d11.dll
2022-05-11T22:05:55.603497809Z  [*] Loading tier0.dll
2022-05-11T22:05:55.675831768Z  [*] Loading Northstar
2022-05-11T22:05:55.749746925Z  [2022-05-11 22:05:55.718] [info] Profile was not found in command line arguments. Using default: R2Northstar
2022-05-11T22:05:55.760466891Z  0024:err:kerberos:kerberos_LsaApInitializePackage no Kerberos support, expect problems
2022-05-11T22:05:55.789040898Z  0024:err:winediag:ntlm_check_version ntlm_auth was not found or is outdated. Make sure that ntlm_auth >= 3.0.25 is in your path. Usually, you can find it in the winbind package of your distribution.
2022-05-11T22:05:55.795982524Z  0024:err:ntlm:ntlm_LsaApInitializePackage no NTLM support, expect problems
2022-05-11T22:05:55.814486202Z  [22:05:55] [info] Enabling hook GetCommandLineAHook
2022-05-11T22:05:55.831579948Z  [22:05:55] [info] Enabling hook LoadLibraryExAHook
2022-05-11T22:05:55.843374783Z  [22:05:55] [info] Enabling hook LoadLibraryAHook
2022-05-11T22:05:55.857997772Z  [22:05:55] [info] Enabling hook LoadLibraryExWHook
2022-05-11T22:05:55.873518540Z  [22:05:55] [info] Enabling hook LoadLibraryWHook
2022-05-11T22:05:55.899822143Z  [22:05:55] [info] Command line: "Z:\tmp\ns2660680541\NorthstarLauncher.exe" -dedicated -port 37015 +setplaylist private_match
2022-05-11T22:05:55.906329919Z  0024:fixme:msvcp:_Locinfo__Locinfo_ctor_cat_cstr (00000000007CF100 1 C) semi-stub
2022-05-11T22:05:55.908433477Z  [22:05:55] [info] NorthstarLauncher version: 1.7.0.0
--- snip ---
2022-05-11T22:05:57.528425232Z  [22:05:57] [info] Enabling hook D3D11CreateDeviceHook
2022-05-11T22:05:58.001662738Z  d3d11: D3D11CreateDevice: initializing d3d11 stub for northstar (github.com/R2Northstar/NorthstarStubs)
2022-05-11T22:05:58.006131020Z  0024:fixme:ntdll:NtQuerySystemInformation info_class SYSTEM_PERFORMANCE_INFORMATION
2022-05-11T22:05:58.021129823Z  [22:05:58] [info] MountVPK vpk/client_frontend.bsp
2022-05-11T22:05:58.137268477Z  0024:fixme:keyboard:LoadKeyboardLayoutW name L"00000409", flags 0, semi-stub!
2022-05-11T22:05:58.235654189Z  0118:err:winediag:nodrv_CreateWindow Application tried to create a window, but no driver could be loaded.
2022-05-11T22:05:58.235705862Z  0118:err:winediag:nodrv_CreateWindow L"Make sure that your X server is running and that $DISPLAY is set correctly."
2022-05-11T22:05:58.251625650Z  nswrap: northstar exited with status 1
2022-05-11T22:05:58.251677207Z  nswrap: waiting for children to exit

Recently, this has been occurring more frequently.

pg9182 commented 2 years ago

cc @L1ghtman2k

pg9182 commented 2 years ago

This appears to be caused by a race condition when closing the displayfd.

In Xorg, it writes the display number, then a newline: https://sourcegraph.com/github.com/mirror/xserver@7b7170ecd636ae1110622e2430549f79598750ca/-/blob/os/connection.c?L189:9.

void
NotifyParentProcess(void)
{
#if !defined(WIN32)
    if (displayfd >= 0) {
        if (write(displayfd, display, strlen(display)) != strlen(display))
            FatalError("Cannot write display number to fd %d\n", displayfd);
        if (write(displayfd, "\n", 1) != 1)
            FatalError("Cannot write display number to fd %d\n", displayfd);
        close(displayfd);
        displayfd = -1;
    }
    if (RunFromSmartParent) {
        if (ParentProcess > 1) {
            kill(ParentProcess, SIGUSR1);
        }
    }
    if (RunFromSigStopParent)
        raise(SIGSTOP);
#endif
}

In nswrap, we close it after reading the display number: https://github.com/pg9182/northstar-dedicated/blame/57bdcff2bd7f6c80191a8dfb4f2647cca3d21374/src/nswrap/nswrap.c#L289.

To fix this, I'm probably going to add another read and/or a delay before closing the FD. As a more permanent fix which doesn't depend on implementation details of Xorg, I'll look into using the "smart parent" SIGUSR1 logic with a signalfd.