jjohnsnaill / SubworldLibrary

Dimensions in Terraria, made easy.
44 stars 24 forks source link

Pipes fail to open on Linux #36

Closed rehashedsalt closed 2 months ago

rehashedsalt commented 8 months ago

Issue Summary

Possibly related to #32, #33, #34. Using Subworld Lib v2.2.2. My execution environment is built from this repo and is a Docker container with no special privileges. Port 7777 is forwarded out.

Upon attempting to use the Spatial Disk to warp to the Observatory Hyperborea in TSA, I get this trace from server.log, after which my client is stuck on "Saving Map Data: 100%" and does not proceed to the subworld:

[14:14:14.558] [Main Thread/DEBUG] [tML]: Process.Start (UseShellExecute = True): "/terraria-server/dotnet/6.0.14/dotnet" tModLoader.dll -server -showserverconsole -world "/terraria/ModLoader/Worlds/World.wld" -subworld "StarsAbove_Observatory"
[14:14:14.571] [.NET ThreadPool Worker/WARN] [tML]: Silently Caught Exception: 
System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: Connection refused /tmp/CoreFxPipe_StarsAbove_Observatory_IN
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.IO.Pipes.NamedPipeClientStream.TryConnect(Int32 timeout, CancellationToken cancellationToken)
   at System.IO.Pipes.NamedPipeClientStream.ConnectInternal(Int32 timeout, CancellationToken cancellationToken, Int32 startTime)
   at System.IO.Pipes.NamedPipeClientStream.Connect(Int32 timeout)
   at SubworldLibrary.SubserverLink.ConnectAndProcessQueue()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

Console mentions this:

Cannot assign requested address /tmp/CoreFxPipe_StarsAbove_Observatory_IN (see server.log for full trace)

The pipe mentioned does exist:

srwxr-xr-x 1 terraria terraria    0 Dec 22 14:14 CoreFxPipe_StarsAbove_Observatory_IN
srwxr-xr-x 1 terraria terraria    0 Dec 22 14:14 CoreFxPipe_StarsAbove_Observatory_OUT

From what it seems to me based on the timeline, the pipe is created, but the host server immediately assumes the subworld server will be able to handle connections the moment the process forks, gets a conn refused, and blows up. The time between Process.Start() forking and the worker thread exploding is only a few milliseconds.

Enabled Mods

jjohnsnaill commented 8 months ago

Subservers blowing up if the pipe doesn't work should be ok, it's vital to them actually working lol Why does it output Cannot assign requested address when the address exists/gets assigned? I'm not a Linux user, so I have no information about what makes this happen.

rehashedsalt commented 8 months ago

Subservers blowing up if the pipe doesn't work should be ok, it's vital to them actually working lol

The master is blowing up here, to be clear -- that's not an exception thrown by the subserver. Or at least I don't believe it is as it shows up too fast.

Why does it output Cannot assign requested address when the address exists/gets assigned? I'm not a Linux user, so I have no information about what makes this happen.

Socket creation is just a filesystem operation, but socket connection is a separate OS syscall, so their presence isn't a representation of the actual network state.

Worth nothing that these are what's called Unix Domain Sockets -- TL;DR: they're on the filesystem instead of operating over IP, and as such have slightly different behavior. This is relevant because "Cannot assign requested address" is the human-readable string for EADDRNOTAVAIL, but that error is only for IP sockets.

It might imply that the .NET pipe is misconfigured somehow.

If it helps any, I was able to get a full stacktrace for that error:

[14:21:58.269] [.NET ThreadPool Worker/WARN] [tML]: Silently Caught Exception: 
System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: Cannot assign requested address /tmp/CoreFxPipe_StarsAbove_Observatory_IN
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at System.IO.Pipes.NamedPipeClientStream.TryConnect(Int32 timeout, CancellationToken cancellationToken)
   at System.IO.Pipes.NamedPipeClientStream.ConnectInternal(Int32 timeout, CancellationToken cancellationToken, Int32 startTime)
   at System.IO.Pipes.NamedPipeClientStream.Connect(Int32 timeout)
   at SubworldLibrary.SubserverLink.ConnectAndProcessQueue()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

The fact that this matches #34's stacktrace and that . is very definitely a legal character for any filename on Linux, including a socket, I'd wager that this is the same issue.

If you'd like a test environment to play with, just let me know -- I can replicate my setup on an isolated Hetzner box for you.

cc @aleforge

rehashedsalt commented 8 months ago

After some back-and-forth with AleForge, they're thinking it's a wide Linux-affecting bug:

The main problem with this issue is that I think there are some differences in the mod's code that only work with the Windows version of DotNet that the mod author hasn't addressed.

Me: Have any of your users had success using this mod? Them: Only when they do not visit a subworld.

I was also able to replicate this issue using just the client and the bundled server. Start client -> Multiplayer -> Host -> Warp to subworld via TSA -> Same crash, logs in ~/.local/share/Steam/steamapps/common/tModLoader/tModLoader-Logs/Old.

jjohnsnaill commented 8 months ago

The master is blowing up here, to be clear -- that's not an exception thrown by the subserver. Or at least I don't believe it is as it shows up too fast.

Oh right, missed the Process.Start. Still sort of applies to the main server though; should've added this a while ago but I'll make it forcefully return the clients to the main world if the pipe goes down.

"Cannot assign requested address" is the human-readable string for EADDRNOTAVAIL, but that error is only for IP sockets.

It's frustrating how little documentation I can find on how this stuff works! Nothing seems to be out of the ordinary with the pipe configuration.

rehashedsalt commented 8 months ago

It's frustrating how little documentation I can find on how this stuff works!

Absolutely! I tried poring over the .NET documentation for that pipe for a while and came up with nothing at all. No nuances into how the pipe functions across operating systems, no source for the reference implementation... Hell, it'd be great if the stacktrace threw anything more than "it broke".

Nothing seems to be out of the ordinary with the pipe configuration.

Yeah, you're not passing fancy args into the constructor or doing anything silly with it, so I'm not sure what the deal is.

Once the holidays have cooled down, I might see about getting a dev environment up to do some testing, but it's been years now since I've touched tML.

kenx00x commented 7 months ago

yeah got the same issue

CreepyCre commented 4 months ago

Tried having a look at this issue as well. I know fairly little about C# so take my guesstimates with a grain of salt.

The way NamedPipeClientStream and NamedPipeServerStream are utilized for the "[...]_IN" pipe seems a bit odd to me. From my limited understanding/ guesstimations NamedPipeServerStream will offer a pipe for a NamedPipeClientStream to attach to. So logically, you would want your NamedPipeServerStream to be awaiting a connection before a NamedPipeClientStream tries to attach. In the case of the "[...]_IN" pipe the NamedPipeClientStream seems to try to connect before a NamedPipeServerStream is spun up, resulting in the error message documented up above. My guess is the Windows implementation somehow doesn't care about the exact order of operations whereas the Linux implementation blows up.

I tried correcting this by simply swapping the NamedPipeClientStream and NamedPipeServerStream in the [...]_IN case, which seemingly fixes the error messages. I still didn't get the subworld to work (this is with Stars Above), though that might just be another mod incomatibility/ unrelated issue with the modpack I am running.

Another huge oversight that prevents looking into this issue further is the fact that when a subworld is spun up its process will take over the server.log archiving the one the parent process was writing to in the process which prevents looking into log messages produced by the parent server.

Should you need a linux server to run tests on I would be open to help out, otherwise there is always the option of running a linux container using Docker for Windows.

rehashedsalt commented 4 months ago

Another huge oversight that prevents looking into this issue further is the fact that when a subworld is spun up its process will take over the server.log archiving the one the parent process was writing to in the process which prevents looking into log messages produced by the parent server.

The bug should absolutely be fixed, but just for insight's sake, you can access the old server logs by looking at file descriptors. /proc/$PID/fd/{0,1,2} will still hold the stale FD of the old log file on the master server. You can less them or whatever to view the overwritten log.

CreepyCre commented 4 months ago

@rehashedsalt thanks for the tip. I can now confirm that my guess is correct, since there are no more error messages produced in either logs. In conclusion the order of operations should be:

  1. Create NamedPipeServerStream instances in the parent process, start threads that call WaitForConnection. (Or possibly use BeginWaitForConnection? Probably the better choice if it works the way I think it works.)
  2. Launch subworld process.
  3. Create NamedPipeClientStream instances on subworld process, call Connect on them.

This should ensure the pipes the subworld process tries to connect to are available at the appropriate time.

CreepyCre commented 4 months ago

My suggested fix doesn't actually work, my test was flawed since #40 caused mods to not actually load on the subserver.

jjohnsnaill commented 3 months ago

God damn it, there were no resources mentioning that this blows up specifically on Linux. Thank you so much for figuring this out! Will close this issue with the release of 2.2.2.2.