ntop / n2n

Peer-to-peer VPN
GNU General Public License v3.0
6.18k stars 930 forks source link

coredump for the latest version on openwrt #991

Closed galaxyskyknight closed 2 years ago

galaxyskyknight commented 2 years ago

it happened after a mwan3 re-apply and firewall reload. somehow the edge exit and core dumped. should not be like this way.

Sun May  8 01:58:49 2022 kern.info kernel: [ 1229.539617] traps: edge[327] trap stack segment ip:41a686 sp:7ffe09873030 error:0 in edge[403000+80000]
Sun May  8 01:58:49 2022 daemon.info avahi-daemon[6009]: Interface n2n.IPv6 no longer relevant for mDNS.
Sun May  8 01:58:49 2022 daemon.info avahi-daemon[6009]: Leaving mDNS multicast group on interface n2n.IPv6 with address fe80::290:10ff:fe00:2.
Sun May  8 01:58:49 2022 daemon.info : 12[KNL] interface n2n deactivated
Sun May  8 01:58:49 2022 daemon.info : 14[KNL] fe80::290:10ff:fe00:2 disappeared from n2n
Sun May  8 01:58:49 2022 daemon.info : 10[KNL] 10.0.0.2 disappeared from n2n
Sun May  8 01:58:49 2022 daemon.info : 12[KNL] interface n2n deleted
galaxyskyknight commented 2 years ago

Also I see the following log: Sun May 8 02:07:13 2022 daemon.info n2n[546]: WARNING: sendto(x.x.x.x:7681) failed (1) Operation not permitted

seems like the firewall try to deny the n2n accessing the peer. but anyway, the n2n code should not running into core dump. there must be some exception in socket or I/O handling I guess, in this special case.

there are 3 addtional core dump traps from kernel:

[ 1229.539617] traps: edge[327] trap stack segment ip:41a686 sp:7ffe09873030 error:0 in edge[403000+80000]
[ 1733.745667] traps: edge[546] trap stack segment ip:41a686 sp:7ffdffec47b0 error:0 in edge[403000+80000]
[ 1884.042965] traps: edge[13350] trap stack segment ip:41a686 sp:7ffc20df6d10 error:0 in edge[403000+80000]
hamishcoleman commented 2 years ago

You can see from the trap errors that the same IP is causing the trap each time. Without a symbol table from your compiled binary, we cannot tell where that IP is in the code. Are you able to run the edge with gdb and capture a backtrace? Is your build process automated and documented anywhere? Do you have the core dump file that was generated?

In case there was any doubt, we do not intend for the code to fail and trap and we want to fix that, but it is going to be practically impossible without some assistance from you.

galaxyskyknight commented 2 years ago

You can see from the trap errors that the same IP is causing the trap each time. Without a symbol table from your compiled binary, we cannot tell where that IP is in the code. Are you able to run the edge with gdb and capture a backtrace? Is your build process automated and documented anywhere? Do you have the core dump file that was generated?

In case there was any doubt, we do not intend for the code to fail and trap and we want to fix that, but it is going to be practically impossible without some assistance from you.

please refer to issue #980 , reported 11 day ago,it is the same problem, As I told, this issue is obviously introduced by the code between https://github.com/ntop/n2n/commit/f3e305b254fc88ce829ddb4a63b11e083a65c3ab to the lastes one(my guess is that it is high probability releated to the change on https://github.com/ntop/n2n/commit/009311d016bf27f40259e6bb992ce4a78af24424). I have identified, if I use this commit:https://github.com/ntop/n2n/commit/f3e305b254fc88ce829ddb4a63b11e083a65c3ab, no this core dump issue at all, whatever I use make clean or not, but once I used the latest or latter one, the issue happend, I suggest that there acuturally is nothing can be done but you can help to inspect the code and if possible, try to do more exceptional test case like this way, say you can using this senario: setup a real edge enviroment and use the iptables restart/reload to simulate the block/unblock the edge access and see it could trigger the issue, I guess this is the tricky and you can use the gdb/coredump in your setup.

For you asked, I cannot help indeed, unless you can tell me how to setup gdb in a openwrt enviroment or how to collect core dump.

I hope this could be addressed and resolved by your guys, otherwise I will be blocked here and cannot upgrade to the latter feature upgrade regarding to the deploy stablization.

thanks for your great help.

Logan007 commented 2 years ago

I suggest that there acuturally is nothing can be done but you can help to inspect the code

I wonder if maybe a ./configure CFLAGS="-fsanitize=address -g", make clean and make would already generate some more helpful output.

Also, what are the details of your build process?

try to do more exceptional test case like this way, say you can using this senario

We definitely are not able to test all possible hardware and software scenarios. That's why we need help from you!

otherwise I will be blocked here and cannot upgrade to the latter feature upgrade regarding to the deploy stablization

And please keep in mind that "dev" is not "latest stable".

Logan007 commented 2 years ago

OK, I will try this ,what does this help?

It should add debug symbols and output some more meaningful output when crashing.

it is openwrt x86 build general build process, automatically, not any special. if you need Makefile, I can attach.

Yes, it would be interesting to see.

btw: I tried today again, once I back to previous commit, the coredump gone.

But I have to ask again, have you ever tried it with make clean before running make again? I have to repeat because I only see these strange and unexplainable crahses and seg faults and so on when I forgot to run make clean before running make again.

sorry, not intended to offend you guys' professional...

:orange_heart:

Oh, don't worry, we are way too professional to feel offended by anything :wink:

galaxyskyknight commented 2 years ago

OK, I will try this ,what does this help?

It should add debug symbols and output some more meaningful output when crashing.

it is openwrt x86 build general build process, automatically, not any special. if you need Makefile, I can attach.

Yes, it would be interesting to see.

btw: I tried today again, once I back to previous commit, the coredump gone.

But I have to ask again, have you ever tried it with make clean before running make again? I have to repeat because I only see these strange and unexplainable crahses and seg faults and so on when I forgot to run make clean before running make again.

Sure,I wrote it to the Makefile.

sorry, not intended to offend you guys' professional...

🧡

Oh, don't worry, we are way too professional to feel offended by anything 😉

:P

galaxyskyknight commented 2 years ago

there are a lot of link error when add the CFLAG, snap some latest lines, too many

/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:68: undefined reference to `__asan_report_store4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:71: undefined reference to `__asan_report_load8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:71: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `mgmt_verbose':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:76: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:77: undefined reference to `__asan_report_load8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `mgmt_event_post2':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:90: undefined reference to `__asan_option_detect_stack_use_after_return'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:90: undefined reference to `__asan_stack_malloc_2'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:93: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:93: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:101: undefined reference to `__asan_report_store8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:104: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:112: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:115: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `mgmt_help_row':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:129: undefined reference to `__asan_report_load8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `mgmt_help_events_row':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:142: undefined reference to `__asan_option_detect_stack_use_after_return'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:142: undefined reference to `__asan_stack_malloc_2'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:147: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:152: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:153: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:154: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:155: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:160: undefined reference to `__asan_report_load8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `mgmt_auth':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:191: undefined reference to `__asan_report_load8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:197: undefined reference to `__asan_report_load4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `mgmt_req_init2':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:215: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:216: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:217: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:225: undefined reference to `__asan_report_load1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:226: undefined reference to `__asan_report_store4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:227: undefined reference to `__asan_report_load1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:228: undefined reference to `__asan_report_store4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:229: undefined reference to `__asan_report_load1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:230: undefined reference to `__asan_report_store4'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:243: undefined reference to `__asan_report_store8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:253: undefined reference to `__asan_report_store8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:260: undefined reference to `__asan_report_store1'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `_GLOBAL__sub_D_00099_0_send_reply':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:280: undefined reference to `__asan_unregister_globals'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: libn2n.a(management.o): in function `_GLOBAL__sub_I_00099_1_send_reply':
/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:280: undefined reference to `__asan_init'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:280: undefined reference to `__asan_version_mismatch_check_v8'
/home/builder/lede_x86/staging_dir/toolchain-x86_64_gcc-8.4.0_musl/lib/gcc/x86_64-openwrt-linux-musl/8.4.0/../../../../x86_64-openwrt-linux-musl/bin/ld: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/src/management.c:280: undefined reference to `__asan_register_globals'
collect2: error: ld returned 1 exit status
make[4]: *** [<builtin>: src/edge] Error 1
make[4]: Leaving directory '/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964'
make[3]: *** [Makefile:91: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/.built] Error 2
make[3]: Leaving directory '/home/builder/lede_x86/package/lean/n2n_v2'
time: package/lean/n2n_v2/compile#17.66#5.38#27.12
    ERROR: package/lean/n2n_v2 failed to build.
make[2]: *** [package/Makefile:116: package/lean/n2n_v2/compile] Error 1
make[2]: Leaving directory '/home/builder/lede_x86'
make[1]: *** [package/Makefile:110: /home/builder/lede_x86/staging_dir/target-x86_64_musl/stamp/.package_compile] Error 2
make[1]: Leaving directory '/home/builder/lede_x86'
make: *** [/home/builder/lede_x86/include/toplevel.mk:230:world] Error 2
hamishcoleman commented 2 years ago

Sorry, we left off a config step - you also need to configure with LDFLAGS="-fsanitize=undefined -static-libubsan"

galaxyskyknight commented 2 years ago

Sorry, we left off a config step - you also need to configure with LDFLAGS="-fsanitize=undefined -static-libubsan"

transform_zstd.o  src/transform_aes.o  src/pearson.o  src/supernode.o  src/example_edge_embed_quick_edge_init.o  src/cc20.o  src/tuntap_netbsd.o  src/edge_management.o  src/edge_utils.o  src/n2n.o  src/tuntap_freebsd.o  src/transform_lzo.o  src/n2n_port_mapping.o  src/random_numbers.o  src/speck.o  src/sn_management.o  src/minilzo.o  src/transform_tf.o  src/example_sn_embed.o  src/edge_utils_win32.o  src/transform_cc20.o  src/hexdump.o  src/tuntap_linux.o  src/n2n_regex.o  src/transform_null.o  src/curve25519.o  src/aes.o  src/sn_utils.o  src/header_encryption.o  src/transform_speck.o  src/edge.o
x86_64-openwrt-linux-musl-gcc -fsanitize=undefined -static-libubsan -pthread -L.  src/edge.o libn2n.a   -ln2n -lpcap -lnatpmp -lminiupnpc -lcrypto  -lcap -o src/edge
x86_64-openwrt-linux-musl-gcc -fsanitize=undefined -static-libubsan -pthread -L.  src/supernode.o libn2n.a   -ln2n -lpcap -lnatpmp -lminiupnpc -lcrypto  -lcap -o src/supernode
x86_64-openwrt-linux-musl-gcc: error: libsanitizer.spec: No such file or directory
make[4]: *** [<builtin>: src/edge] Error 1
make[4]: Leaving directory '/home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964'
make[3]: *** [Makefile:91: /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-23e168b9551258983a4187357a4fcb57d060f964/.built] Error 2
make[3]: Leaving directory '/home/builder/lede_x86/package/lean/n2n_v2'
time: package/lean/n2n_v2/compile#22.03#7.93#23.43
    ERROR: package/lean/n2n_v2 failed to build.
make[2]: *** [package/Makefile:116: package/lean/n2n_v2/compile] Error 1
hamishcoleman commented 2 years ago

Unfortunately, it is clear the the build environment provided by lede does not support the sanitizer options.

In order to narrow down anything about the error you are experiencing, there are a couple of options.

Since you have posted that you have seen the full error message from the sendto log output, it is unlikely that the core of patch you have pointed at is at fault as most of the new code is run before it outputs the log line.

galaxyskyknight commented 2 years ago

still happened with your latest code https://github.com/ntop/n2n/commit/a274818854c01008a4106a44fd5ecd33d14091a4, I cannot do more for the debug but the following log listed a clear senario of its happening.... I guess maybe it is related to the n2n interface active/deactive flap from kernel which the program not well handle something so that probabaly cause the core dump based on the log.

Wed May 18 10:50:30 2022 daemon.notice netifd: Interface 'n2n' is enabled
Wed May 18 10:50:30 2022 daemon.notice netifd: Network device 'n2n' link is up
Wed May 18 10:50:30 2022 daemon.notice netifd: Interface 'n2n' has link connectivity
Wed May 18 10:50:30 2022 daemon.notice netifd: Interface 'n2n' is setting up now
Wed May 18 10:50:30 2022 daemon.notice procd: /etc/rc.d/S99n2n_v2: 18/May/2022 10:50:30 [edge.c:1261] created local tap device IP: 10.0.0.2, Mask: 255.255.255.0, MAC: 00:90:10:00:00:02
Wed May 18 10:50:30 2022 daemon.info n2n[12803]: parent process is exiting (this is normal)
Wed May 18 10:50:30 2022 daemon.info n2n[12813]: WARNING: running as root is discouraged, check out the -u/-g options
Wed May 18 10:50:30 2022 daemon.info : 12[KNL] interface n2n activated
Wed May 18 10:50:30 2022 daemon.notice netifd: Interface 'n2n' is now up
Wed May 18 10:50:30 2022 daemon.info n2n[12813]: edge started
Wed May 18 10:50:30 2022 daemon.info n2n[12813]: WARNING: failed to bind to local multicast group 224.0.0.68:1968 [errno 19]
Wed May 18 10:50:30 2022 daemon.info n2n[12813]: WARNING: sendto(*.*.*.*:10086) failed (101) Network unreachable
Wed May 18 10:50:30 2022 daemon.info : 09[KNL] 10.0.0.2 appeared on n2n
Wed May 18 10:50:30 2022 daemon.info : 12[KNL] 10.0.0.2 disappeared from n2n
Wed May 18 10:50:30 2022 daemon.info : 09[KNL] 10.0.0.2 appeared on n2n
Wed May 18 10:50:30 2022 kern.info kernel: [   49.145834] traps: edge[12813] trap stack segment ip:466f14 sp:7ffe4a1af390 error:0 in edge[403000+80000]
Wed May 18 10:50:30 2022 daemon.info : 14[KNL] interface n2n deactivated
Wed May 18 10:50:30 2022 daemon.notice netifd: Network device 'n2n' link is down
Wed May 18 10:50:30 2022 daemon.notice netifd: Interface 'n2n' has link connectivity loss
Wed May 18 10:50:30 2022 daemon.notice netifd: Interface 'n2n' is now down
Wed May 18 10:50:30 2022 daemon.info : 10[KNL] 10.0.0.2 disappeared from n2n
Wed May 18 10:50:30 2022 daemon.info : 13[KNL] interface n2n deleted
Wed May 18 10:50:30 2022 daemon.notice netifd: Interface 'n2n' is disabled
Logan007 commented 2 years ago

Could you please run the edge with -vvvvvv to provide more detailed log output, the DEBUG class of messages, as well?

Also, the IP-address you x-ed out, is it a local internal address, or an external public one?

galaxyskyknight commented 2 years ago

Could you please run the edge with -vvvvvv to provide more detailed log output, the DEBUG class of messages, as well?

Also, the IP-address you x-ed out, is it a local internal address, or an external public one?

external of coz, its my supernode IP and port is 10086..:) in that time, the pppoe session is not complete yet, so the supernode unreachable is reasonable.

ok, I will add -vvvvv in init scripts next time for debug. please be noticed: seeems the core only happened at the openwrt router reboot in system init stage, and after reboot, I manually pull up the edge and it is not seen core again for so long if I donot touch any network configuration and change anything.

hamishcoleman commented 2 years ago

I think you said that you compiled your binary with debugging (the -g) option, can you try addr2line -e <your binary> 0x466f14 and send us the result?

small-5 commented 2 years ago

https://github.com/ntop/n2n/blob/a274818854c01008a4106a44fd5ecd33d14091a4/src/edge_utils.c#L1040 However, when wan is closed or the network is abnormal(Network unreachable), goto err_out and return -1, coredump will occur. 1

galaxyskyknight commented 2 years ago

I think you said that you compiled your binary with debugging (the -g) option, can you try addr2line -e <your binary> 0x466f14 and send us the result?

I hope this will do great help to you guys, do you have any assert for the eee pointer? I am just guessing the eee is a NULL pointer now, is there any multi-thread invoke?

image image

hamishcoleman commented 2 years ago

On Wed, May 18, 2022 at 01:07:06AM -0700, Maha-5 wrote:

[1]https://github.com/ntop/n2n/blob/a274818854c01008a4106a44fd5ecd33d14 091a4/src/edge_utils.c#L1040 However, when wan is closed or the network is abnormal, goto err_out and return - 1, coredump will occur.

Can you isolate where the coredump happens? I have reviewed the code path that you are describing and can find no unchecked memory accesses that could cause a coredump in the err_out path. It must be happening somewhere further up the call chain, which encompases a quite large amount of code.

small-5 commented 2 years ago

@hamishcoleman Sorry, I don't have the environment to analyze the core file.I can only be sure that return -1 on openwrt, it will coredump. Before 009311d016bf27f40259e6bb992ce4a78af24424 ,supernode_disconnect(eee) will be called only when rc <= 0. In this case, even if wan is down, will not coredump. 1

hamishcoleman commented 2 years ago

On Wed, May 18, 2022 at 02:06:29AM -0700, Maha-5 wrote:

@.*** Sorry, I don't have the environment to analyze the core file.I can only be sure that return -1 on openwrt, it will coredump. Before [2]009311d ,XXX will be called only when rc <= 0. In this case, even if wan is down, coredump will not be called. [3]1

The same code can actually be called in both paths, even in the earlier commit you refer to.

Can you compile your binary with debugging? (this is configure CFLAGS=-g) If you can get a coredump from a binary that has debugging enabled then we may be able to take the kernel error message (which has an "ip:" value) and determine where in the source the error is occuring. If you post your debug enabled binary and the resulting coredump then we might be able to extract that information without using your environment.

small-5 commented 2 years ago

https://github.com/ntop/n2n/pull/999 I have tested 1047 lines, use if(sent != -1) or if(sent > 0), no coredump, but > 0 can be two less character. The main reason is that 1076 lines need to return sent

galaxyskyknight commented 2 years ago

I think you said that you compiled your binary with debugging (the -g) option, can you try addr2line -e <your binary> 0x466f14 and send us the result?

I hope this will do great help to you guys, do you have any assert for the eee pointer? I am just guessing the eee is a NULL pointer now, is there any multi-thread invoke?

image image

@hamishcoleman Have you found any clue for this? thanks .

hamishcoleman commented 2 years ago

@galaxyskyknight no, I have no clues, I am searching in the dark, I really need help from you to track this down - since you can replicate the issue, I need you to help with some actual debugging. You should try and compile with debug symbols and get a coredump file or a backtrace or a addr2line result. No amount of looking at screenshots of sections of code will find the actual location that is causing the coredump.

hamishcoleman commented 2 years ago

I apologise, I see that one of the tiny screenshots you attached above was an addr2line output. Can you please paste the text so that I can cut and paste the details without any chance of errors. Screenshots are not useful debug tools

galaxyskyknight commented 2 years ago

I apologise, I see that one of the tiny screenshots you attached above was an addr2line output. Can you please paste the text so that I can cut and paste the details without any chance of errors. Screenshots are not useful debug tools

So does that line of code give you any hint? Is it possible that it is like what I guess which theeee pointer somehow run into NULL or invalid due to the interface link up/down status flap senario(probabaly this pointer is recycled somewhere in other routine due to interface down or sendto fail exception handler)?

small-5 commented 2 years ago

https://github.com/ntop/n2n/pull/1001 This PR fixed.

Logan007 commented 2 years ago

This PR fixed.

It still is a diagnostic PR to help us understand the issues related to it.

Is it possible that it is like what I guess which the eee pointer somehow run into NULL or invalid due to the interface link up/down status flap senario(probabaly this pointer is recycled somewhere in other routine due to interface down or sendto fail exception handler)?

I have not seen any chance for this to happen so far but I like surprises. Hard to say. @galaxyskyknight could you provide the requested text information to help us fully understand the problem?

galaxyskyknight commented 2 years ago

This PR fixed.

It still is a diagnostic PR to help us understand the issues related to it.

Is it possible that it is like what I guess which the eee pointer somehow run into NULL or invalid due to the interface link up/down status flap senario(probabaly this pointer is recycled somewhere in other routine due to interface down or sendto fail exception handler)?

I have not seen any chance for this to happen so far but I like surprises. Hard to say. @galaxyskyknight could you provide the requested text information to help us fully understand the problem?

I don't understand what the text information you are required? I have put the addr2line info there , it is clear illustrated that the coredump happened on edge_utils.c: line 2894, the line of code snapple is also attached there, what else are you expecting for?

Logan007 commented 2 years ago

what else are you expecting for?

Can you please paste the text so that I can cut and paste the details without any chance of errors. Screenshots are not useful debug tools

galaxyskyknight commented 2 years ago

what else are you expecting for?

Can you please paste the text so that I can cut and paste the details without any chance of errors. Screenshots are not useful debug tools

what's different for you to understand the information that I pasted?

small-5 commented 2 years ago

Not compiled with https://github.com/ntop/n2n/pull/1001 , compiled and tested with a274818854c01008a4106a44fd5ecd33d14091a4

root@Router:~# edge -f -d n2n -l www.test.com:10254 -c test -A4 -k asdfasdfasf -a 10.0.0.100/24 -r -H -vvvvvv 20/May/2022 23:58:05 [n2n.c:288] WARNING: supernode2sock fails to resolve supernode host www.test.com, -3: Try again 20/May/2022 23:58:05 [edge_utils.c:3590] adding supernode = www.test.com:10254 20/May/2022 23:58:05 [edge.c:1112] starting n2n edge 3.1.1 May 19 2022 17:33:00 20/May/2022 23:58:05 [edge.c:1118] using compression: none. 20/May/2022 23:58:05 [edge.c:1119] using ChaCha20 cipher. 20/May/2022 23:58:05 [edge_utils.c:402] number of supernodes in the list: 1 20/May/2022 23:58:05 [edge_utils.c:404] supernode 0 => www.test.com:10254 20/May/2022 23:58:05 [transform_cc20.c:134] setup_cc20_key completed 20/May/2022 23:58:05 [edge_utils.c:437] Header encryption is enabled. 20/May/2022 23:58:05 [edge.c:1143] use manually set IP address 20/May/2022 23:58:05 [edge.c:1161] skip PING to supernode 20/May/2022 23:58:05 [edge_utils.c:314] PMTU discovery disabled 20/May/2022 23:58:05 [edge.c:1225] skip auto IP address asignment 20/May/2022 23:58:05 [tuntap_linux.c:203] Waiting for TAP interface to be up and running... 20/May/2022 23:58:05 [tuntap_linux.c:224] Interface is up and running 20/May/2022 23:58:05 [edge.c:1258] created local tap device IP: 10.0.0.100, Mask: 255.255.255.0, MAC: EA:26:6D:40:E2:70 20/May/2022 23:58:05 [edge.c:1325] WARNING: n2n has not been compiled with libcap-dev; some commands may fail 20/May/2022 23:58:05 [edge.c:1330] dropping privileges to uid=65534, gid=65534 20/May/2022 23:58:05 [edge.c:1356] edge started 20/May/2022 23:58:05 [edge_utils.c:1564] update_supernode_reg: doing fast retry. 20/May/2022 23:58:05 [edge_utils.c:1167] WARNING: failed to bind to local multicast group 224.0.0.68:1968 [errno 19] 20/May/2022 23:58:10 [n2n.c:288] WARNING: supernode2sock fails to resolve supernode host www.test.com, -3: Try again 20/May/2022 23:58:10 [edge_utils.c:2137] Rx TAP packet ( 110) for 33:33:00:00:00:16 20/May/2022 23:58:10 [edge_utils.c:2143] dropping Tx multicast 20/May/2022 23:58:10 [edge_utils.c:1564] update_supernode_reg: doing fast retry. 20/May/2022 23:58:10 [edge_utils.c:1167] WARNING: failed to bind to local multicast group 224.0.0.68:1968 [errno 19] 20/May/2022 23:58:15 [n2n.c:288] WARNING: supernode2sock fails to resolve supernode host www.test.com, -3: Try again 20/May/2022 23:58:15 [n2n.c:604] Purging old registrations 20/May/2022 23:58:15 [n2n.c:609] Remove 0 registrations 20/May/2022 23:58:15 [edge_utils.c:2137] Rx TAP packet ( 110) for 33:33:00:00:00:16 20/May/2022 23:58:15 [edge_utils.c:2143] dropping Tx multicast 20/May/2022 23:58:15 [edge_utils.c:1564] update_supernode_reg: doing fast retry. 20/May/2022 23:58:15 [edge_utils.c:1167] WARNING: failed to bind to local multicast group 224.0.0.68:1968 [errno 19] 20/May/2022 23:58:20 [n2n.c:288] WARNING: supernode2sock fails to resolve supernode host www.test.com, -3: Try again 20/May/2022 23:58:20 [edge_utils.c:1226] send PING to supernodes 20/May/2022 23:58:20 [edge_utils.c:1069] sendto(0.0.0.0:10254) failed (97) Address family not supported by protocol 20/May/2022 23:58:20 [edge_utils.c:352] closed 20/May/2022 23:58:20 [edge_utils.c:1083] error in sendto_fd Bus error

1

Logan007 commented 2 years ago

what's different for you to understand the information that I pasted?

You know, we want to do the debugging for the bug you encountered. And maybe even try to reproduce and check ourselves because we assume some underlying issue here... it would be kind if you could support us with easier accessible information (not scrennshot but text) so we can debug more easily and perhaps even a bit faster without the hazzle of trying to type long strings from screenshots which would add another source for typos and slow things down because of it. Thank you!

small-5 commented 2 years ago

The coredump will occur when failed to bind to local multicast group In addition, running the process directly does not generate a core file. If it is called through openwrt's netifd, a core file will be generated.

galaxyskyknight commented 2 years ago

what's different for you to understand the information that I pasted?

You know, we want to do the debugging for the bug you encountered. And maybe even try to reproduce and check ourselves because we assume some underlying issue here... it would be kind if you could support us with easier accessible information (not scrennshot but text) so we can debug more easily and perhaps even a bit faster without the hazzle of trying to type long strings from screenshots which would add another source for typos and slow things down because of it. Thank you!

That's the next time things! we are not talking the same thing. even I type or copy it again, what do you think it will indeed help how much to dig issue futher more? you guys have already read and understand it even it is a screen snapshot format, right? You can ask me next time to do the proper way or the way you like and I will be glad to do that, but one time is good enough, I don't think the formalism could help you found the root cause much faster even if I do paste it here right now.

galaxyskyknight commented 2 years ago

every time it occusrs in the same senario in my openwrt log: the supernode (175.*.*.158) change its punchhole port, howerver the local openwrt edge invoke sendto somehow handle the sendto() function in fatal error way,then the edge coredumped. it is not only happend on interface status flap. FYI, I don't know what is the sendto failed (1) Operation not permitted means, I did google and it seems related to IPv6? Can you initial the socket with IPv4 bind only? I have disabled ipv6/ip6tables on my openwrt. not sure if this will cause the failed(1) Operation not permitted problem.

@Logan007 @hamishcoleman

Mon May 23 11:08:01 2022 daemon.info n2n[17453]: peer 00:90:10:00:00:01 changed [175.*.*.158:36679] -> [175.*.*.158:59739]
Mon May 23 11:18:11 2022 daemon.info n2n[17453]: WARNING: sendto(175.*.*.158:36679) failed (1) Operation not permitted
Mon May 23 11:18:11 2022 kern.info kernel: [48818.807630] traps: edge[17453] trap stack segment ip:466f14 sp:7fff9bdb6070 error:0 in edge[403000+80000]
Mon May 23 11:18:11 2022 daemon.notice netifd: Network device 'n2n' link is down
Mon May 23 11:18:11 2022 daemon.notice netifd: Interface 'n2n' has link connectivity loss
Mon May 23 11:18:11 2022 daemon.notice netifd: Interface 'n2n' is now down
Mon May 23 11:18:11 2022 daemon.notice netifd: Interface 'n2n' is disabled
Mon May 23 11:18:12 2022 daemon.info vnstatd[11876]: Info: Interface "n2n" disabled.
galaxyskyknight commented 2 years ago

The coredump will occur when failed to bind to local multicast group In addition, running the process directly does not generate a core file. If it is called through openwrt's netifd, a core file will be generated.

seems it is the same to me.

galaxyskyknight commented 2 years ago

Hello, is there anyone still working on this issue? thanks.

Logan007 commented 2 years ago

I think the status is still "open".

galaxyskyknight commented 2 years ago

I am pretty sure no core issue before f3e305b254fc88ce829ddb4a63b11e083a65c3ab commit (on April 11th )after so many days observing

galaxyskyknight commented 2 years ago

It is still core on the latest code in the same place: Mon Jun 27 18:34:25 2022 daemon.notice netifd: Interface 'n2n' is enabled Mon Jun 27 18:34:25 2022 daemon.notice netifd: Interface 'n2n' is setting up now Mon Jun 27 18:34:25 2022 daemon.notice netifd: Interface 'n2n' is now up Mon Jun 27 18:34:25 2022 daemon.notice netifd: Network device 'n2n' link is up Mon Jun 27 18:34:25 2022 daemon.notice netifd: Interface 'n2n' has link connectivity Mon Jun 27 18:34:25 2022 daemon.info n2n[12531]: WARNING: running as root is discouraged, check out the -u/-g options Mon Jun 27 18:34:25 2022 daemon.info n2n[12531]: edge started Mon Jun 27 18:34:25 2022 daemon.info n2n[12531]: WARNING: failed to bind to local multicast group 224.0.0.68:1968 [errno 19] Mon Jun 27 18:34:25 2022 daemon.info n2n[12531]: WARNING: sendto(175...*:10086) failed (101) Network unreachable Mon Jun 27 18:34:25 2022 kern.info kernel: [ 47.222356] traps: edge[12531] trap stack segment ip:419dea sp:7ffefd75e860 error:0 in edge[403000+7e000] Mon Jun 27 18:34:25 2022 daemon.notice netifd: Network device 'n2n' link is down Mon Jun 27 18:34:25 2022 daemon.notice netifd: Interface 'n2n' has link connectivity loss Mon Jun 27 18:34:25 2022 daemon.notice netifd: Interface 'n2n' is now down Mon Jun 27 18:34:25 2022 daemon.notice netifd: Interface 'n2n' is disabled Mon Jun 27 18:34:27 2022 daemon.notice procd: /etc/rc.d/S99n2n_v2: SIOCADDRT: Network unreachable Mon Jun 27 18:34:27 2022 daemon.notice procd: /etc/rc.d/S99n2n_v2: SIOCADDRT: Network unreachable Mon Jun 27 18:34:27 2022 daemon.notice procd: /etc/rc.d/S99n2n_v2: SIOCADDRT: Network unreachable

builder@Build-Server:~/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-439dfc68865a286c48c79672a350cd467da38799$ addr2line -e edge 419dea /home/builder/lede_x86/build_dir/target-x86_64_musl/n2n-3.1.1_dev_git-439dfc68865a286c48c79672a350cd467da38799/src/edge_utils.c:2872

image

galaxyskyknight commented 2 years ago

I am tired for this issue, is there any workaround fix? seems there is memory violation access for the 'eee' structure in some corner case within this code line.

galaxyskyknight commented 2 years ago

Looks like this commit fix the coredump for now, I will keep an eye for a couple of days to see if it would happen again or gone, thanks for the efforts. @hamishcoleman

https://github.com/ntop/n2n/commit/06c489fd8ad42d6c025beaea0fd62d7d4d948c31