nwnxee / unified

Binaries available under the Releases tab on Github
https://nwnxee.github.io/unified
GNU General Public License v3.0
131 stars 92 forks source link

Fatal error: Segmentation fault (Unknown cause, sometimes upon switching areas) #1664

Open GMXanther opened 1 year ago

GMXanther commented 1 year ago

This happens in the midst of area transition. Currently the cause is unknown.

NWNX Signal Handler:

NWNX 8193.35-40 (b419e42) has crashed. Fatal error: Segmentation fault (11). Please file a bug at https://github.com/nwnxee/unified/issues

Backtrace: /nwn/nwnx/NWNX_Core.so(_ZN7NWNXLib8Platform13GetStackTraceB5cxx11Eh+0x3b) [0x7f3ad8509b6b] /nwn/nwnx/NWNX_Core.so(nwnx_signal_handler+0xac) [0x7f3ad84b9b7c] /lib/x86_64-linux-gnu/libc.so.6(+0x3bf90) [0x7f3ad7fb7f90] /nwn/nwnx//NWNX_Player.so(+0xdb68) [0x7f3ad3cfcb68] ./nwserver(_ZN11CNWSMessage31SendServerToPlayerGameObjUpdateEP10CNWSPlayerji+0x1a1) [0x55714ba52d11] ./nwserver(_ZN21CServerExoAppInternal32UpdateClientGameObjectsForPlayerEP10CNWSPlayerim+0xb4) [0x55714ba7ae14] ./nwserver(_ZN21CServerExoAppInternal23UpdateClientGameObjectsEi+0x80) [0x55714ba7b0e0] ./nwserver(_ZN21CServerExoAppInternal8MainLoopEv+0x7ea) [0x55714ba8386a] ./nwserver(main+0x13a5) [0x55714b7ed535] /lib/x86_64-linux-gnu/libc.so.6(+0x2718a) [0x7f3ad7fa318a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3ad7fa3245] ./nwserver(_start+0x2a) [0x55714b7f188a]

Shad000w commented 1 year ago

Finally.

I reported this almost 2 months ago, but because it is me, nobody even looked into it and they just told me it is related to whatever I am doing in my own nwnx. Maybe now the wonderful nwn-ee developers will finally look into this?

https://github.com/Beamdog/nwn-issues/issues/491

This crash happened 20 times already on my server. It does not seems to be NWNX-related. The crash seems to be random, I was not able to reproduce it forcefully even though I know the same thing you know - it happens when changing area transition, however right after it is done, not in middle - area on enter script is fired.

Most likely when entering instanced area where someone else already is (not conclusive). Over the 20 crash reports, there is nothing in any of the logging I have that would show the culprit. Player characters differs and I see nothing common between the characters that caused this crash already (except many went to the same instanced area).

There is no nwscript script that would be fired everytime when this happens. Ie. this happens out of the nwscript "queve". Most of the time the last script is random creature heartbeat, but sometimes it was gui event or area enter (which wasn't related).

NWNX reports that the last function that fires is CanLevelUp - which makes sense as the function SendServerToPlayerGameObjUpdate is checking it. I don't hook any other function that this function is using, and the only thing I do with CanLevelUp is to return 0 if a player is possessing familiar or has certain variable set. Which neither of the players who crashed it had - in all cases the original function value was returned (which was 0 as they had max level).

Of course one cannot exclude the possibility that it is really NWNX specific, but now when it crashed to the nwnx-unified user this chance is much lesser now. And even if it is caused by changing player character in some way, it is obvious that the crash is happening because some unhandled pointer validity inside the function.

Daztek commented 1 year ago

Nobody looked at it because nwn-issues is for basegame issues without nwnx or other third party things injected and because there's no clear easy repro.

Shad000w commented 1 year ago

Nobody looked at it because nwn-issues is for basegame issues without nwnx or other third party things injected and because there's no clear easy repro.

So you automatically disregarded it. Great. Really shows your professionalism to let this go into stable version just because of that.

Even if it is really NWNX related, which is just a theory - NWNX crash looks differently, I just wasn't able to reproduce it to proof it is standard game issue or not. It takes 5 minutes to check "Did we change anything in the function it crashes in?".

Because if you did change that function then the chance is pretty high don't you think?

Would it be possible to send the source code of this function? I have only symbols for 1.69 to work with, if the function is calling something new I wouldn't be able to hook it and check where it crashes.

GMXanther commented 1 year ago

I would appreciate a little less hostility in my issue thread, thanks... I've already been scared off enough from suggesting things or really talking much at all at this point. I'm thankful for any info about where the crash is coming from so I can try to avoid doing that to my players in the future, that's all.

mtijanic commented 1 year ago

@Shad000w : First of, the beamdog/nwn-issues repo has this line in the readme:

Are you reporting for a persistent world that is using NWNX? Please report NWNX-related bugs with them: https://github.com/nwnxee/unified. We will not address NWNXEE bugs.

Your issue should have been closed without comment based on that alone. Any additional feedback you received, including this post, is a pure courtesy.

Additionally, when you get a crash with NWNX running, it will also direct you to this repo. In the original post of this issue, it says:

Please file a bug at https://github.com/nwnxee/unified/issues

I noticed that in your private nwnx_patch.so you went through the trouble of removing that bit of instructions. Since the instructions are meant for you, you're free to remove them if you think you don't need them, but it looks to me like you do.

Because NWNX modifies the game in all sorts of unsupported ways, it is not feasible to even consider these crashes as possible basegame issues. If you believe they are, please provide a repro without any unexpected binary modifications in effect. Or provide concrete explanation of the bug with proof that it happens in basegame too.

Second, even for this project, we cannot and will not look at issues that have closed source out of tree plugins (IIUC, nwnx_patch.so is not even a plugin, but a custom fork of nwnx_core!?). This has nothing to do with you personally, please don't play victim. No one looks at who reported the issue, we look at the logs only.

Lastly, and I'm sad that this has to be spelled out, but NWNX is developed by unpaid volunteers, and no one owes you anything. We welcome (and look at) all bug reports, but there is no guarantee that anyone will actually spend their valuable time working on any particular issue. Any abuse of the goodwill from these volunteers will result in a ban.


@GMXanther - thank you for the bug report. As explained above, we can't provide any guarantees on when this will be looked at. Going by historic data from previous such crashes, I'd say a month or two, unless it starts showing up more frequently or it turns out to be obvious.

Either way, just the existence of the bug report is valuable info, so please report any other issues you hit too. :heart:

Thanks

niv commented 1 year ago

I would appreciate a little less hostility in my issue thread, thanks... I've already been scared off enough from suggesting things or really talking much at all at this point. I'm thankful for any info about where the crash is coming from so I can try to avoid doing that to my players in the future, that's all.

Hello!

Please don't feel discouraged. You're very welcome to engage on this forum and should under no circumstances feel scared away. We'll deal with the hostility, not you.

But we also want to help with this issue. To that end, some questions:

GMXanther commented 1 year ago

Hello and thank you both for jumping in with the understanding messages, I appreciate all your time and effort in all of what you do here! Luckily the crash hasn't reoccurred so far, despite some similar circumstances on the server.

Shad000w commented 1 year ago

Your issue should have been closed without comment based on that alone. Any additional feedback you received, including this post, is a pure courtesy.

How kind of you.

I noticed that in your private nwnx_patch.so you went through the trouble of removing that bit of instructions. Since the instructions are meant for you, you're free to remove them if you think you don't need them, but it looks to me like you do.

Lol, why would I do that? Why would I replace the text in the crash handler where is written report at "nwnx-unified" with "beamdog" ? I forked it before you added this.

(IIUC, nwnx_patch.so is not even a plugin, but a custom fork of nwnx_core!?).

Correct. I say plugin, but it is actually a fork of nwnx-unified without plugin functionality, it is one monolithic library that does everything without any modular functionality besides ini switches. I don't see how is it relevant though, I mean - are you saying that because I am not using the official nwnx-unified I am not welcomed here at all and I will not receive any support? Fair enough I guess - explains a lot.

No one looks at who reported the issue, we look at the logs only.

I am sorry, but this is very hard to believe. I have dozen of issues on beamdog bug tracker that:

And for years my issues remains totally ignored. Once I logged on nwnx-unified discord few years after I left and I searched history and I saw another user reporting the same base nwn bug there with traps and you discussing it. Some time later that user found my bug report at beamdog tracker and mentioned there is a module where it can be easily tested. And both you and niv wrote that you never looked at that module (it was already 2 years since I posted it) and even bragged about that like its some virtue lol.

As far as I know, you were in Beamdog team back then already. And even if you weren't, now you are both officially the maintainers of NWN-EE and nwnx-unified. So can you explain me why you treating my bug reports this way?

I mean look at this issue itself. Again - the same peoples work on NWN-EE now and nwnx-unified. So you read my crash issue at beamdog bugtracker. Why didn't you write something like "hey man you should have reported it at nwnx-unified bug tracker not here, please do it and we look at it" or just simply look into it anyway despite I reported it at wrong place as you say - you admit that you read it so why you completely disregarded it? Why not ask me the same questions you asked GMXanther here?

And btw, I don't use per-player vfxs at all. I looked at NWNX_Player.nss and the only thing I have in my nwnx_patch.so is GetBicFileName and show/hide Progress bar function. The former is read only, the latter was not used for the characters that crashed the server. And no players remained in game. I have screenshots from one of the player where he gets into the area but the whole area is black.

Lastly:

Because NWNX modifies the game in all sorts of unsupported ways, it is not feasible to even consider these crashes as possible basegame issues.

I disagree. There is clearly some unhandled null pointer. That is simply a bug whether it can be triggered with base game or not. Sure, NWNX can be used to do unexpected stuff, but I see no difference in this exact case with NWNX and classic custom content. Would you say the same if I added a custom prestige class that triggers this? And if no, why is it any different?

Anyway, I didn't formulate it well in my previous post. I suppose I didn't raise a question so I am now asking properly.

1) Did the NWN-EE patch team modified the culprit function in .35 ? 2) And if yes. Is is possible to get a raw source for this function ?

I am not a professional like you guys, but I know how to track and fix thing. If you are not able to or willing to fix it yourself I can do this for you if you lend me a hand.

Cjreek commented 1 year ago

Don't forget to put on your tinfoil hat. They're trying to get you!

But to be honest I couldn't fault anyone for ignoring your requests if this is how you write them. Every message is just flat out hostile and (passive) aggressive towards the people you're trying to get help from.

niv commented 1 year ago

@Shad000w: You have been consistently confrontational with us and our various projects for the past 5+ years. As a consequence, you are now blocked from posting on this forum (This decision can be revisited in the medium future, at the team's discretion.)

You can keep using NWNX itself as long as you adhere to the license terms.

niv commented 1 year ago

Hello and thank you both for jumping in with the understanding messages, I appreciate all your time and effort in all of what you do here! Luckily the crash hasn't reoccurred so far, despite some similar circumstances on the server.

  • NWNX version: Yes, the standard build8193.35.40 run through Docker with unified:latest. No plugin modifications, but there is heavy usage of nwnx functions in scripts.
  • Looping VFX: Yes, I did apply some of these to players during a DM interaction, via NWNX_Player_ApplyLoopingVisualEffectToObject. However, these were not active during the transition where the crash happened, it should have been removed a while ago at that point when the same function ran again to remove it. The looping VFX was a short term effect that reapplied to remove itself via DelayCommand a few seconds later. If some data from the VFX was retained after the point where it was dismissed, that would be strange and does seem like a potential culprit.
  • Player becoming invalid: This shouldn't be the case, the second time I saw this crash it was on my local test server so there was no reason it would've disconnected me.

Thanks for the clarification. It was a hunch about the player exiting and I'm not surprised it didn't track.

The one thing I can tell you here is that going by the presented stacktrace, it's definitely crashing inside the LoopingVFX code. From reading the code, it's not clear why it's happening.

If the crashes are frequent, then one way to test/workaround this would be to simply not use any of it. The code that handles it hooks at first call, so once it is used even once, the hook will always run. Since the reported crash is inside the hook, never calling it should have no chance of triggering this particular crash. Unfortunately, that doesn't help chase it down, just to avoid it (and you lose the functionality, of course).

I'd ask you to turn on full coredumps (e.g. ulimit -c unlimited), but running inside Docker complicates that a bit. We're talking on Discord to figure out how to accomplish that easily. Please note: If you do generate a full coredump, don't share it with anyone except official developers: They contain a full memory copy of absolutely everything inside your gameserver process, including personal/private and module data.

Someone will get back to this ticket. We'd appreciate any updates on testing or further crash reports.

Daztek commented 1 year ago

@GMXanther You can try enabling the new NWNX_TWEAKS_FIX_AUTOMAP_CRASH tweak, I'm 99% sure that'll fix it.

GMXanther commented 1 year ago

After running for a while with the NWNX_TWEAKS_FIX_AUTOMAP_CRASH enabled, I can confirm this particular crash seems to have stopped happening. However, we have a new one in its place that I'm not sure whether it's related or not. This crash happens occasionally in scenarios when a creature's AI script causes it to use DestroyObject on itself or other creatures. (An example would be a boss that deletes its adds when it is in the process of death. It's not simultaneous, it's just part of the script that leads to the boss dying.)

  NWNX Signal Handler: 
 ==============================================================
  NWNX 8193.35-40 (b419e42) has crashed. Fatal error: Segmentation fault (11).
  Please file a bug at https://github.com/nwnxee/unified/issues
 ==============================================================

   Backtrace:
     /nwn/nwnx/NWNX_Core.so(_ZN7NWNXLib8Platform13GetStackTraceB5cxx11Eh+0x3b) [0x7fcffe0f7b6b]
     /nwn/nwnx/NWNX_Core.so(nwnx_signal_handler+0xac) [0x7fcffe0a7b7c]
     /lib/x86_64-linux-gnu/libc.so.6(+0x3bf90) [0x7fcffdba5f90]
     /lib/x86_64-linux-gnu/libc.so.6(__libc_free+0x1a) [0x7fcffdc02cda]
     ./nwserver(_ZN15CServerAIMaster15DeleteEventDataEjPv+0x99) [0x563a08f502c9]
     ./nwserver(_ZN15CServerAIMaster11UpdateStateEv+0x1c3) [0x563a08f511b3]
     ./nwserver(_ZN21CServerExoAppInternal8MainLoopEv+0x7dc) [0x563a08f6685c]
     ./nwserver(main+0x13a5) [0x563a08cd0535]
     /lib/x86_64-linux-gnu/libc.so.6(+0x2718a) [0x7fcffdb9118a]
     /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fcffdb91245]
     ./nwserver(_start+0x2a) [0x563a08cd488a]