n42n / n3n

Peer to Peer VPN
51 stars 6 forks source link

edge crashes on Windows, memory access violation (LZO decompression?) #33

Open aarojun opened 1 month ago

aarojun commented 1 month ago

Myself and some other users are experiencing common crashes on n3n-edge 3.3.4 due to access violation during lzo1x decompression. I'm unsure how to reproduce this for further debugging. If the decompression method is unsafe I suppose one user sending corrupt packets could make a community momentarily unusable.

Issue https://github.com/ntop/n2n/issues/1165 could be related. We may test if using an older versions of n2n (pre 3.1.1) for edges on communities changes this behavior.

WinDbg !analyze -v result:

Details

``` ******************************************************************************* * * * Exception Analysis * * * ******************************************************************************* KEY_VALUES_STRING: 1 Key : AV.Fault Value: Read Key : Analysis.Elapsed.mSec Value: 479 Key : Analysis.IO.Other.Mb Value: 0 Key : Analysis.IO.Read.Mb Value: 0 Key : Analysis.IO.Write.Mb Value: 0 Key : Analysis.Init.Elapsed.mSec Value: 7206 Key : Analysis.Memory.CommitPeak.Mb Value: 81 Key : Failure.Bucket Value: INVALID_POINTER_READ_c0000005_n3n-edge.exe!Unknown Key : Failure.Hash Value: {c9829084-8ed0-c0cc-a3a7-5d8477630a4e} Key : Timeline.OS.Boot.DeltaSec Value: 25183 Key : Timeline.Process.Start.DeltaSec Value: 520 Key : WER.OS.Branch Value: ni_release Key : WER.OS.Version Value: 10.0.22621.1 Key : WER.Process.Version Value: 3.0.0.10 FILE_IN_CAB: n3n-edge.exe.4000.dmp NTGLOBALFLAG: 0 APPLICATION_VERIFIER_FLAGS: 0 CONTEXT: (.ecxr) rax=000000820b5fd6f8 rbx=0000000000000069 rcx=0000000000000003 rdx=000000820b5f2828 rsi=000000820b5fcefd rdi=000000820b5fd700 rip=00007ff64e824c48 rsp=000000820b5fc728 rbp=000000820b5fc8fc r8=000000820b5fd6d0 r9=000000820b5fc798 r10=000000820b5fced0 r11=000000820b5f2828 r12=0000000000000000 r13=0000000000000008 r14=0000000066477365 r15=000000820b5fc8f0 iopl=0 nv up ei pl zr ac po nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010256 n3n_edge+0x34c48: 00007ff6`4e824c48 488b2a mov rbp,qword ptr [rdx] ds:00000082`0b5f2828=???????????????? Resetting default scope EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ff64e824c48 (n3n_edge+0x0000000000034c48) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter[0]: 0000000000000000 Parameter[1]: 000000820b5f2828 Attempt to read from address 000000820b5f2828 PROCESS_NAME: n3n-edge.exe READ_ADDRESS: 000000820b5f2828 ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s. EXCEPTION_CODE_STR: c0000005 EXCEPTION_PARAMETER1: 0000000000000000 EXCEPTION_PARAMETER2: 000000820b5f2828 IP_ON_STACK: +0 00000082`0b5fe53e 3b903f90b335 cmp edx,dword ptr [rax+35B3903Fh] FRAME_ONE_INVALID: 1 STACK_TEXT: 00000082`0b5fc728 00000082`0b5fe53e : 00000000`00000069 00000082`0b5fd6d0 00000082`0b5fc8fc 00000243`38c67ab0 : n3n_edge+0x34c48 00000082`0b5fc730 00000000`00000069 : 00000082`0b5fd6d0 00000082`0b5fc8fc 00000243`38c67ab0 00000082`0b5fd6d0 : 0x00000082`0b5fe53e 00000082`0b5fc738 00000082`0b5fd6d0 : 00000082`0b5fc8fc 00000243`38c67ab0 00000082`0b5fd6d0 00007ff6`4e81a104 : 0x69 00000082`0b5fc740 00000082`0b5fc8fc : 00000243`38c67ab0 00000082`0b5fd6d0 00007ff6`4e81a104 00007ff6`4e841f3f : 0x00000082`0b5fd6d0 00000082`0b5fc748 00000243`38c67ab0 : 00000082`0b5fd6d0 00007ff6`4e81a104 00007ff6`4e841f3f 00000082`0b5fe53e : 0x00000082`0b5fc8fc 00000082`0b5fc750 00000082`0b5fd6d0 : 00007ff6`4e81a104 00007ff6`4e841f3f 00000082`0b5fe53e 00000000`00000079 : 0x00000243`38c67ab0 00000082`0b5fc758 00007ff6`4e81a104 : 00007ff6`4e841f3f 00000082`0b5fe53e 00000000`00000079 00000000`00000002 : 0x00000082`0b5fd6d0 00000082`0b5fc760 00007ff6`4e841f3f : 00000082`0b5fe53e 00000000`00000079 00000000`00000002 00000000`00000000 : n3n_edge+0x2a104 00000082`0b5fc768 00000082`0b5fe53e : 00000000`00000079 00000000`00000002 00000000`00000000 00000243`38c67ab0 : n3n_edge+0x51f3f 00000082`0b5fc770 00000000`00000079 : 00000000`00000002 00000000`00000000 00000243`38c67ab0 00000082`0b5fced0 : 0x00000082`0b5fe53e 00000082`0b5fc778 00000000`00000002 : 00000000`00000000 00000243`38c67ab0 00000082`0b5fced0 00000000`00000000 : 0x79 00000082`0b5fc780 00000000`00000000 : 00000243`38c67ab0 00000082`0b5fced0 00000000`00000000 00000082`0b5fc8f0 : 0x2 STACK_COMMAND: ~0s; .ecxr ; kb SYMBOL_NAME: n3n_edge+34c48 MODULE_NAME: n3n_edge IMAGE_NAME: n3n-edge.exe FAILURE_BUCKET_ID: INVALID_POINTER_READ_c0000005_n3n-edge.exe!Unknown OS_VERSION: 10.0.22621.1 BUILDLAB_STR: ni_release OSPLATFORM_TYPE: x64 OSNAME: Windows 10 FAILURE_ID_HASH: {c9829084-8ed0-c0cc-a3a7-5d8477630a4e} Followup: MachineOwner --------- ```

hamishcoleman commented 1 month ago

What leads you to believe this is a problem in the LZO? Do you have verbose logs from a crash?

aarojun commented 1 month ago

Sorry, the above was a poor example. I had a string of crashes where the failure bucket referenced lzo1x decompress. Sometimes this happened a moment after a peer joined a community or formed a connection. I'll try to reproduce with verbose logs. The core problem may still be something else.

hamishcoleman commented 1 month ago

Thanks! More data will certainly help dig into this.

Are you using the binaries from the release or compiling your own? (I want to check if the reported addresses should match up with the symbols we have)

NiKola-UE commented 1 month ago

@aarojun, it's good if you managed to run any N3N application on Windows at all because I just didn't. I downloaded both, unblocked them, but nothing happens. Antimalwares and antiviruses didn't come up (I currently use Avira and Malwarebytes Anti-Malware), even Windows Defender didn't react as something suspicious, but there's just nothing, even when I run them from the administrator. Why, I really don't known.

hamishcoleman commented 1 month ago

@NiKola-UE we test n3n on Windows regularly to prove that it is definitely working. You should create a new ticket and describe your situation in that new ticket so we can work on figuring out what is happening for you.

NiKola-UE commented 1 month ago

OK, I'll try again. I'm using Windows 11, but I don't know if that has anything to do with it. If there are problems again, I will open a new issue for it, although I have already said everything here. Maybe I missed something after all...

aarojun commented 1 month ago

@NiKola-UE if it wasn't clear it's not drop-in replacement to n2n: the edge.exe must be called with "edge.exe start" instead of just double clicking, config files are located in user\n3n\ and the command syntax has changed from n2n. see the docs if these are causing issues. For me it runs generally without issue on Windows 11, just with different syntax (which is often more readable) compared to n2n. But I recommend starting a new issue.

For the memory access violation crashes I had the following crash today but don't have more details as I didn't have the debugger running and couldn't reproduce it on a short timeframe.

The vast majority of these crashes have referenced !lzo1x_decompress which is why I brought it up in the opening comment.

Details

``` ******************************************************************************* * * * Exception Analysis * * * ******************************************************************************* KEY_VALUES_STRING: 1 Key : AV.Fault Value: Write Key : Analysis.CPU.mSec Value: 359 Key : Analysis.Elapsed.mSec Value: 527 Key : Analysis.IO.Other.Mb Value: 0 Key : Analysis.IO.Read.Mb Value: 3 Key : Analysis.IO.Write.Mb Value: 0 Key : Analysis.Init.CPU.mSec Value: 46 Key : Analysis.Init.Elapsed.mSec Value: 7986 Key : Analysis.Memory.CommitPeak.Mb Value: 79 Key : Failure.Bucket Value: INVALID_POINTER_WRITE_c0000005_n3n-edge-v3.3.4.exe!lzo1x_decompress Key : Failure.Hash Value: {c22f5744-d13e-e417-97fa-1dac132180ce} Key : Timeline.OS.Boot.DeltaSec Value: 134897 Key : Timeline.Process.Start.DeltaSec Value: 194 Key : WER.OS.Branch Value: ni_release Key : WER.OS.Version Value: 10.0.22621.1 FILE_IN_CAB: n3n-edge-v3.3.4.exe.24020.dmp NTGLOBALFLAG: 0 APPLICATION_VERIFIER_FLAGS: 0 CONTEXT: (.ecxr) rax=000000c834200006 rbx=0000000000000074 rcx=0000000000007450 rdx=000000c8341fe18c rsi=0000000000000000 rdi=000000c83420656e rip=00007ff682c74d43 rsp=000000c8341fc958 rbp=0000000000000e8a r8=000000c8341fd900 r9=000000c8341fc9c8 r10=000000c8341fd100 r11=000000c8341ff07c r12=0000000000007458 r13=00000000000007a0 r14=0000000066491ffa r15=000000c8341fcb20 iopl=0 nv up ei pl nz na pe nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010200 n3n_edge_v3_3_4!lzo1x_decompress+0x1e3: 00007ff6`82c74d43 488970f8 mov qword ptr [rax-8],rsi ds:000000c8`341ffffe=???????????????? Resetting default scope EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 00007ff682c74d43 (n3n_edge_v3_3_4!lzo1x_decompress+0x00000000000001e3) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter[0]: 0000000000000001 Parameter[1]: 000000c834200000 Attempt to write to address 000000c834200000 PROCESS_NAME: n3n-edge-v3.3.4.exe WRITE_ADDRESS: 000000c834200000 ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s. EXCEPTION_CODE_STR: c0000005 EXCEPTION_PARAMETER1: 0000000000000001 EXCEPTION_PARAMETER2: 000000c834200000 STACK_TEXT: 000000c8`341fc958 00007ff6`82c6a104 : 00007ff6`82c91f3f 000000c8`341fe76e 00000000`00000084 00000000`00000002 : n3n_edge_v3_3_4!lzo1x_decompress+0x1e3 000000c8`341fc990 00007ff6`82c49bbe : 01daa96b`cf808048 0000023a`ea860468 00000000`00000000 00000000`00000000 : n3n_edge_v3_3_4!transop_decode_lzo+0x34 000000c8`341fc9e0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : n3n_edge_v3_3_4!process_udp+0xa6e STACK_COMMAND: ~0s; .ecxr ; kb FAULTING_SOURCE_LINE: /home/runner/work/n3n/n3n/src/minilzo.c FAULTING_SOURCE_FILE: /home/runner/work/n3n/n3n/src/minilzo.c FAULTING_SOURCE_LINE_NUMBER: 5468 FAULTING_SOURCE_CODE: No source found for '/home/runner/work/n3n/n3n/src/minilzo.c' SYMBOL_NAME: n3n_edge_v3_3_4!lzo1x_decompress+1e3 MODULE_NAME: n3n_edge_v3_3_4 IMAGE_NAME: n3n-edge-v3.3.4.exe FAILURE_BUCKET_ID: INVALID_POINTER_WRITE_c0000005_n3n-edge-v3.3.4.exe!lzo1x_decompress OS_VERSION: 10.0.22621.1 BUILDLAB_STR: ni_release OSPLATFORM_TYPE: x64 OSNAME: Windows 10 FAILURE_ID_HASH: {c22f5744-d13e-e417-97fa-1dac132180ce} Followup: MachineOwner --------- ```

I believe the build here is the Windows x64 binary from https://github.com/n42n/n3n/commit/51eb3d7b9419ec129e04ea65018191ea65997a5a https://github.com/hamishcoleman/n3n/actions/runs/9088706213/artifacts/1503386924

Not sure why I had that on the edge, I'll default to the current release binary for future testing. But this may go on hold for me depending on how much use n3n is seeing

hamishcoleman commented 1 month ago

I suspect that the memory access issue is not in the lzo library, but in the pointers that have been handed to the lzo.

eebssk1 commented 1 month ago

releated https://github.com/ntop/n2n/issues/1165

hamishcoleman commented 1 month ago

Hi @eebssk1 , do you have any logs or dumps or any way to reproduce this issue that you can share?

eebssk1 commented 1 month ago

Hi @eebssk1 , do you have any logs or dumps or any way to reproduce this issue that you can share?

I currently does not use n2n much. I have set mtu to 1280. And I'm not getting the problem for now. It seems happen more often if network is unstable or game is bursting a lot small packets. I'll share more information next time this happens.

[Anyway I added a extra check for corrupted data so it logs the incident and silently continue, preventing crash. As I said I didn't see the incident log yet]

hamishcoleman commented 1 month ago

If your extra check is able to prevent the crash, are you able to share what you added?

eebssk1 commented 1 month ago

If your extra check is able to prevent the crash, are you able to share what you added?

Nah it's just a quick dirty hack to prevent crash by checking if the struct data are garbage https://github.com/ntop/n2n/issues/1165#issuecomment-2041234363_

According to that stacktrace and memory dump, the function crashes at the end, so I add a check at beginning to skip the entire function if data are corrupt.

hamishcoleman commented 1 month ago

That is unfortunate, since I cannot see any specific buffer overflow in the code, nor am I able to reproduce any similar crash with heavy testing on Linux.

So, without more data or a reproducible case, I doubt we can make any progress here.

eebssk1 commented 1 month ago

That is unfortunate, since I cannot see any specific buffer overflow in the code, nor am I able to reproduce any similar crash with heavy testing on Linux.

So, without more data or a reproducible case, I doubt we can make any progress here.

I never get any problems on linux, seems only happen on windows. Anyway when it happens I'll try to collect data as much as possible.

hamishcoleman commented 1 month ago

I will be trying to setup a better windows test environment, but it is a lot more difficult than testing on Linux

NiKola-UE commented 3 weeks ago

I would just like to complete what I already mentioned here (it is not necessary for a new issue): In my case, it is more and more certain that antiviruses interfere, block and make it impossible to start N3N and similar apps tools, not allowing them to work or even to install them at all, falsely recognizing them as something potentially dangerous, suspicious, unwanted, fraudulent, harmful, etc. The instructions, tutorials and guides should also indicate this, as well as explanations of how it should be properly set up and configured for us without advanced technical knowledge who know nothing or very little about programming; which is complex and demanding, but it will still be helpful and useful. Thank you.

eebssk1 commented 3 weeks ago

In my case, I never have any problem with zstd compression. If I switch to lzo then it crash in ~30 secs even without active connections. I'm currently still stuck with n2n(Though I added addition work to make it work with mingw and openssl 3.2).

hamishcoleman commented 3 weeks ago

@NiKola-UE It is very hard to include any clear statements about antivirus products as - pretty much by design - they are unclear on exactly how to bypass them. I would hope that there are clear log entries from any antivirus software making it obvious that they have blocked something.

hamishcoleman commented 3 weeks ago

@eebssk1 Can you outline what is keeping you on n2n? Perhaps there is something we can merge into n3n to make a migration easier?

eebssk1 commented 3 weeks ago

@eebssk1 Can you outline what is keeping you on n2n? Perhaps there is something we can merge into n3n to make a migration easier?

The windows GUI maker @happyntec does not have any interests in supporting n3n. So then I made my own GUI but it is still designed according to the original n2n to ensure compability with many of my friends. I may lately add support for n3n to it but It's unlikely to happen in near future since it's now stable enough for us to use currently.

However if you may make n3n(in new branch) a drop-in replacement that comes with only bugfix but not API/CLI breakage then it's really appreciated.

eebssk1 commented 3 weeks ago

I would just like to complete what I already mentioned here (it is not necessary for a new issue): In my case, it is more and more certain that antiviruses interfere, block and make it impossible to start N3N and similar apps tools, not allowing them to work or even to install them at all, falsely recognizing them as something potentially dangerous, suspicious, unwanted, fraudulent, harmful, etc. The instructions, tutorials and guides should also indicate this, as well as explanations of how it should be properly set up and configured for us without advanced technical knowledge who know nothing or very little about programming; which is complex and demanding, but it will still be helpful and useful. Thank you.

Do you have any explicit source that indicate AV/FW blocks n2n? Mine is compiled by myself, UPX compressed and then self signed(as well as my GUI). None of my friends and my AV blocks it currently.

hamishcoleman commented 3 weeks ago

n3n is deliberately cleaning up the API and CLI, so there will not be the compatibility you are looking for.

I'd say that the n2n is the unstable one - the number of stability bugs that have been found and fixed in n3n is only increasing.

It is a pity that none of these GUI systems were contributed to the repo, otherwise it would have been possible to forward port them along with the other n3n changes.

NiKola-UE commented 2 weeks ago

To conclude: I used Avast early, whose adware and false positive alarming is really annoying, and now I use Avira, which, at least from the latest version, automatically blocks everything that looks like that, which I will have to deal with in detail myself. Maybe antiviruses and similar programs can sometimes cause these and similar chrashes, which is what this issue primarily deals with, but I don't know that...

hamishcoleman commented 2 weeks ago

@NiKola-UE It does sound like we should have a different ticket to track reports of issues with antivirus software - perhaps you could create one and add the logs and event messages generated by your antivirus software so that they can be examined and checked if there are steps that can be taken to help avoid triggering them

NiKola-UE commented 2 weeks ago

You're probably right. I will do so and record everything nicely when I have more time, but I think it's best to open a separate issue that will deal with it.