utoni / nDPId

Tiny nDPI based deep packet inspection daemons / toolkit.
GNU General Public License v3.0
67 stars 15 forks source link

OpenWRT: Unexpected flood of daemon events ("daemon_event_name":"status") right after start... #11

Closed verzulli closed 2 years ago

verzulli commented 2 years ago

After upgrading nDPId on my OpenWRT box (built on commit 36f1786 ), I'm registering an unexpected behaviour: right after launching the daemon with:

/usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -o max-packets-per-flow-to-send=0

where 192.168.0.128 is my local notebook, properly listening on UDP/9999, I receive a STORM of daemon-events. In 5 to 10 seconds, I got 16358 JSONs, and only 13 of them where related to "flows". Other 16345 where daemon-related, with ~99% of them "status". Is this an expected behaviour?

I'm attached the above 16358 JSONs, should you need to check them. Again: they have been received in ~10 seconds (maybe less) raw_json.zip

utoni commented 2 years ago

The timestamps delivered by libpcap are wrong. Which libpcap version are you using?

// EDIT: Seems like an uint64 underflow on my side. Will investigate.

utoni commented 2 years ago

Did you build nDPId with the OpenWrt Makefile provided in packages/openwrt?

verzulli commented 2 years ago

Did you build nDPId with the OpenWrt Makefile provided in packages/openwrt?

Yes. I succesfully built the OpenWRT package following this instructions.

(Note: I used the building environment built around the openwrt-sdk-21.02.3-ipq40xx-generic_gcc-8.4.0_musl_eabi.Linux-x86_64.tar SDK. But I don't think this is much important...)

utoni commented 2 years ago

Strange. I am trying to reproduce this.

verzulli commented 2 years ago

Strange. I am trying to reproduce this.

I'm on OWRT 21.02.3. I'll try rebuilding the package "from scratch", once more. Should the behaviour persist, I'll try upgrading to OWRT 22.03.0 (been released early this month).

Will keep you updated.

verzulli commented 2 years ago

I just rebuilt FROM SCRATCH the OpenWRT package:

I got a nDPId-testing_1.0-5_arm_cortex-a7_neon-vfpv4.ipk that was slightly different than the previous one (1056936 bytes, while the previous one was 1055494).

I installed it on the OpenWRT box and... behaviour was the same.

BUT....

I also noted that the problem seem to happen when I launch it without the -d (running it in foreground). If I run it in background (with -d), it start the flood but suddenly stop and start working normally.

I also noted that, when running with -d, the PID assigned to /usr/sbin/nDPId-testing changes: it seems it forks and die...

root@GW-Big-Brother:~# ps | grep -i ndpid
28538 root     41080 D    /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -d -o max-packets-per-flow-to-send=
28540 root      1100 R    grep -i ndpid
root@GW-Big-Brother:~# ps | grep -i ndpid
28538 root     76744 R    /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -d -o max-packets-per-flow-to-send=
28542 root      1100 S    grep -i ndpid
root@GW-Big-Brother:~# ps | grep -i ndpid
28538 root      110m R    /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -d -o max-packets-per-flow-to-send=
28544 root      1100 R    grep -i ndpid
root@GW-Big-Brother:~# ps | grep -i ndpid
28546 nobody    125m S    /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -d -o max-packets-per-flow-to-send=
28564 root      1100 S    grep -i ndpid
root@GW-Big-Brother:~# ps | grep -i ndpid
28546 nobody    125m S    /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -d -o max-packets-per-flow-to-send=
28566 root      1100 R    grep -i ndpid
root@GW-Big-Brother:~# ps | grep -i ndpid
28546 nobody    125m S    /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -d -o max-packets-per-flow-to-send=
28568 root      1100 R    grep -i ndpid
root@GW-Big-Brother:~# ps | grep -i ndpid
28546 nobody    125m S    /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -d -o max-packets-per-flow-to-send=
28570 root      1100 R    grep -i ndpid

As you can see, it spawned with 28538 and few seconds after, it became 28546. Currently, after ~10 mintues, it's still 28546 and it's running fine.

Could the problem be related to the missing of -d?

Will continue observing the whole things and... I'll keep you updated.

utoni commented 2 years ago

I also noted that the problem seem to happen when I launch it without the -d (running it in foreground). If I run it in background (with -d), it start the flood but suddenly stop and start working normally.

Can you confirm that you'll get those weird timestamps only while running in daemon mode?

Could the problem be related to the missing of -d?

Unlikely, because the timestamps are IMHO not related to the daemon mode in any case.

Will continue observing the whole things and... I'll keep you updated.

Btw.: Did you test nDPId-test in a QEMU VM or on a real devicE?

utoni commented 2 years ago

I tried to build&use nDPId-testing with the current OpenWrt master branch. I've build an x86 OpenWrt image with only nDPId-testing and a *.vdi target image enabled. I've used that image together with VirtualBox, configured the network in the guest and ran: /usr/sbin/nDPId-testing -d -c 192.168.0.20:7000 -o max-packets-per-flow-to-send=0. It worked, the timestamps are valid and not broken. Can you try this and see if that works for you? That would probably help finding the error.

utoni commented 2 years ago

Another thing that might help me: Could you upload the OpenWrt build logs of libpcap and nDPId-testing ?

verzulli commented 2 years ago

Btw.: Did you test nDPId-test in a QEMU VM or on a real devicE?

I'm running nDPId-test on a REAL device: a GL-iNet GL-B1300 router. I'm running OWRT OpenWrt 21.02.3, r16554-1d4dea6d4f

verzulli commented 2 years ago

Can you confirm that you'll get those weird timestamps only while running in daemon mode?

As far as I can see:

But I cannot be sure (I cannot do enough tests to be 100% sure) :-(

verzulli commented 2 years ago

Another thing that might help me: Could you upload the OpenWrt build logs of libpcap and nDPId-testing ?

I just went through ALL the rebuild-from-scratch process. You can find, below, a terminal transcript of the whole preparation steps (up to make) and the whole building process (from make, on)

preparation.txt build.txt

This just shed a light on me: I ALWAYS uploaded to OWRT ONLY the nDPId-testing package... and NOT the libpcap1 nor the zlib ones, as they are already available on my box...

...but they are DIFFERENT versions, right?

I'm going to reinstall all the three packages, and let you know...

verzulli commented 2 years ago

I just reinstalled both libpcap1_1.9.1-3.1_arm_cortex-a7_neon-vfpv4.ipk and nDPId-testing_1.0-5_arm_cortex-a7_neon-vfpv4.ipk and now everything seems working.

I'll keep checking everything for a few days, with several restart... and let you know the results.

Sorry for being so dumb to not take care to update libpcap as well.... :-(

utoni commented 2 years ago

I just reinstalled both libpcap1_1.9.1-3.1_arm_cortex-a7_neon-vfpv4.ipk and nDPId-testing_1.0-5_arm_cortex-a7_neon-vfpv4.ipk and now everything seems working.

Is that the same libpcap version that get's shipped with openwrt-sdk-21.02.3-ipq40xx-generic_gcc-8.4.0_musl_eabi.Linux-x86_64.tar ? If so it actually should work without reinstalling libpcap. This is strange. I will see if I can add an error or warning if the libpcap version nDPId was compiled/linked against differs from the runtime version.

I'll keep checking everything for a few days, with several restart... and let you know the results.

Sorry for being so dumb to not take care to update libpcap as well.... :-(

You're not dumb. TBH I dunno what the issue here is or was as well.

verzulli commented 2 years ago

Is that the same libpcap version that get's shipped with openwrt-sdk-21.02.3-ipq40xx-generic_gcc-8.4.0_musl_eabi.Linux-x86_64.tar ? If so it actually should work without reinstalling libpcap.

Of course, you are right. The libpcap package is built from OpenWRT sources so... it's exactly what's running inside my OpenWRT box. I was wrong, as I tough about libNPDI, that it's included from specific source/commit during the build. But as it's included inside nDPId-testing, it (libNDPI) doesn't create an ad-hoc .ipk OWRT package).

Sorry.

verzulli commented 2 years ago

Unfortunately, the flood-problem persists.

Yesterday I run the software this way (in foreground, and then with a ^C, stopping them):

root@GW-Big-Brother:~# /usr/sbin/nDPId-testing -i br-lan -c 192.168.0.128:9999 -o max-packets-per-flow-to-send=0
nDPId: No instance alias given, using your hostname 'GW-Big-Brother'
----------------------------------
nDPI version: 4.5.0
 API version: 48566
pcap version: 1.9.1 (with TPACKET_V3)
----------------------------------
gcrypt version: 1.8.6internal
----------------------------------
^C------------------------------------ Stopping reader threads
------------------------------------ Processing remaining flows
------------------------------------ Results
Stopping Thread  0, processed 266000 packets, 167033739 bytes
    skipped flows.....:        0, processed flows:   168487, idle flows....:   168487
    not detected flows:   167031, guessed flows..:      818, detected flows:      641
    detection updates.:      238, updated flows..:     1663
Stopping Thread  1, processed 306928 packets, 262903923 bytes
    skipped flows.....:        0, processed flows:   102889, idle flows....:   102889
    not detected flows:   101397, guessed flows..:      718, detected flows:      775
    detection updates.:      144, updated flows..:      979
Stopping Thread  2, processed 18683 packets, 7511917 bytes
    skipped flows.....:        0, processed flows:     2431, idle flows....:     2431
    not detected flows:     1120, guessed flows..:      627, detected flows:      686
    detection updates.:      241, updated flows..:     1541
Stopping Thread  3, processed 23416 packets, 20625367 bytes
    skipped flows.....:        0, processed flows:     2401, idle flows....:     2401
    not detected flows:      254, guessed flows..:     1652, detected flows:      497
    detection updates.:      161, updated flows..:  1229624
Stopping Thread  4, processed 158297 packets, 94580056 bytes
    skipped flows.....:        0, processed flows:    99580, idle flows....:    99580
    not detected flows:    98089, guessed flows..:      770, detected flows:      721
    detection updates.:      275, updated flows..:     1642
Stopping Thread  5, processed 286529 packets, 172160835 bytes
    skipped flows.....:        0, processed flows:   192556, idle flows....:   192556
    not detected flows:   191591, guessed flows..:      572, detected flows:      394
    detection updates.:      136, updated flows..:      899
Stopping Thread  6, processed 148943 packets, 91502090 bytes
    skipped flows.....:        0, processed flows:    98657, idle flows....:    98657
    not detected flows:    97291, guessed flows..:      627, detected flows:      740
    detection updates.:      289, updated flows..:     1809
Stopping Thread  7, processed 156934 packets, 94238252 bytes
    skipped flows.....:        0, processed flows:   109650, idle flows....:   109650
    not detected flows:   107721, guessed flows..:     1527, detected flows:      402
    detection updates.:      140, updated flows..:      732
Stopping Thread  8, processed 240138 packets, 192616304 bytes
    skipped flows.....:        0, processed flows:    97885, idle flows....:    97885
    not detected flows:    96102, guessed flows..:      750, detected flows:     1034
    detection updates.:      252, updated flows..:     1704
Stopping Thread  9, processed 365571 packets, 230981799 bytes
    skipped flows.....:        0, processed flows:   227795, idle flows....:   227795
    not detected flows:   226409, guessed flows..:      816, detected flows:      574
    detection updates.:      160, updated flows..:     1000
Stopping Thread 10, processed 153480 packets, 92991154 bytes
    skipped flows.....:        0, processed flows:   101175, idle flows....:   101175
    not detected flows:    99869, guessed flows..:      628, detected flows:      680
    detection updates.:      284, updated flows..:     4799
Stopping Thread 11, processed 13485 packets, 7893745 bytes
    skipped flows.....:        0, processed flows:     6978, idle flows....:     6978
    not detected flows:     3949, guessed flows..:     2459, detected flows:      574
    detection updates.:      170, updated flows..:      847
Stopping Thread 12, processed 232377 packets, 184852452 bytes
    skipped flows.....:        0, processed flows:    96932, idle flows....:    96932
    not detected flows:    95650, guessed flows..:      583, detected flows:      700
    detection updates.:      253, updated flows..:     1788
Stopping Thread 13, processed 143480 packets, 102131690 bytes
    skipped flows.....:        0, processed flows:    86951, idle flows....:    86951
    not detected flows:    85477, guessed flows..:      728, detected flows:      747
    detection updates.:      142, updated flows..:      732
Stopping Thread 14, processed 265783 packets, 217180620 bytes
    skipped flows.....:        0, processed flows:    95751, idle flows....:    95751
    not detected flows:    94098, guessed flows..:      912, detected flows:      743
    detection updates.:      305, updated flows..:     1769
Stopping Thread 15, processed 8639 packets, 4419421 bytes
    skipped flows.....:        0, processed flows:     3262, idle flows....:     3262
    not detected flows:     1604, guessed flows..:     1063, detected flows:      602
    detection updates.:      142, updated flows..:      878
Total packets captured.......: 2739976
Total packets processed......: 2788683
Total layer4 payload size....: 1943623364
Total flows ignopred.........: 0
Total flows processed........: 1493380
Total flows timed out........: 1493380
Total flows detected.........: 10510
Total flows guessed..........: 15250
Total flows not detected.....: 1467652
Total flow updates...........: 1252406
Total flow detections updates: 3332
root@GW-Big-Brother:~# 

100% of related JSONs received by my analyzer have been dumped (by analyzer) to a raw-JSON file. This:

verzulli@XPSGarr tmp]$ ls -l *json*
-rw-r--r-- 1 verzulli verzulli 1297698451 29 set 19.57 raw_json.stream

and inside, there are a whole set of 2.035.351 JSONs, with 1.721.317 status related:

[verzulli@XPSGarr tmp]$ wc -l raw_json.stream
2035351 raw_json.stream
[verzulli@XPSGarr tmp]$ grep '"daemon_event_name":"status"' raw_json.stream | wc -l
1721317
[verzulli@XPSGarr tmp]$

when I asked to my analyzer, it responded this way:

nDPId-rt-analyzer> show counter
=> main `analyzer` counters:
----------
{
  "in": 1968312,
  "ok": 1968264,
  "err": 48,
  "evt": {
    "flows": 272802,
    "pkt": 313,
    "others": 1695197
  }
}
----------
=> specific `parser` counters:
----------
{
  "flows": {
    "new": 69266,
    "detected": 4473,
    "not-detected": 67244,
    "detection-update": 3111,
    "update": 61537,
    "guessed": 1445,
    "end": 1296,
    "idle": 64430
  },
  "packets": {
    "packet": 313
  }
}
----------
=> FLOW TABLE SIZES:
=> Current Global table size...: [ 121626] flows
=> Current Completed table size: [  65527] flows

----------
started on: Thu Sep 29 2022 18:15:29 GMT+0200 (Ora legale dell’Europa centrale) - uptime: 6095 seconds [102 minutes]
nDPId-rt-analyzer> 
nDPId-rt-analyzer> exit

I bet the others": 1695197 are mostly the status events we're discussing about...

Due to size constraint (upload, here, is limited to 25MB), I've created a file with first 100.000 + last 20.000 JSONs. Here it is.

[verzulli@XPSGarr tmp]$ wc -l raw_json.stream.head_100000_plus_tail_20000.stream
120000 raw_json.stream.head_100000_plus_tail_20000.stream
[verzulli@XPSGarr tmp]$

You can find it, zipped, here below raw_json.stream.head_100000_plus_tail_20000.stream.zip

As a side note: if you could expand a little the reasoning about such status daemon-event and, also, the way they should be fired (you mentioned "timestamp", but... sorry!... I haven't understood :-( ) I'd appreciate :-)

utoni commented 2 years ago

I am highly motivated to find the cause of the issue.

As a side note: if you could expand a little the reasoning about such status daemon-event and, also, the way they should be fired (you mentioned "timestamp", but... sorry!... I haven't understood :-( ) I'd appreciate :-)

I am trying to explain what is going on here; Those daemon - status events are sent every interval (default: 600 seconds) to notify on a regular base applications about nDPId usage e.g. the amount of skipped/active flows, compression statistics and it is planned to sent also details about CPU/Memory. Since the interval is calculated using timestamps with a microsecond resolution, it is important that those are valid. But this is not the case here. We are getting timestamps that look like as an uint64 underflow happened. All timestamps are coming from libpcap and are converted by libndpi. For now, I can not tell if the issue is related to:

  1. libpcap
  2. libndpi
  3. nDPId

So I need your help to figure out whats going wrong here. I will prepare and upload a small patchset ASAP. It would be nice, if you could apply that, recompile nDPId and upload the resulting logfile.

utoni commented 2 years ago

Hi @verzulli, can you please patch nDPId and start it with something like:

sudo env TXTLOG=/tmp/bla.log ./nDPId

and upload the logfile here. I guess you do not have to run nDPId for a long time since from what I saw is that this issue happens right after startup.

Patchfile: nDPId.patch.txt

verzulli commented 2 years ago

Here we are. I just built the patched nDPId-testing package and re-run it on my OpenWRT box. I needed to apply some tricks (namely: declaring a FIFO-file as the local log-target + attaching an ncat reading instance on it to transfer the content to my local PC.... as I have only few megabytes of local cache available in my BOX), but I'm sure that EVERYTHING WAS OK.

I run the deamon (and the analyzer, dumping the JSONs) and, strangely, everything looked normal, without the flood raising within the first few minutes. I went away and get back after a bit more than two hours and... BINGO, flood were running.

I stopped the daemon and retrieved results.

In the meantime, I collected what your required, inside a ndpid.log file and the JSON-stream dump (should you need it). As they are both HUGE (1.4G and 4G respectively), I bzipped2 them and sent results to a public MEGA folder

Probably what you're searching is around line 9.596.563 of ndpid.log :-)

utoni commented 2 years ago

Can you please try patch nDPId.patch.txt and give me feedback if it fixes your issue.

verzulli commented 2 years ago

I took the chance to slightly edit your last patch to still fetch and print the ndpi_timeval_to_microseconds(header->ts) result, in a dedicated variable time_us_ndpi, so to clearly check mismatches between your recalculated time_us and previous time_us_ndpi value . The patch is here: ndpid_dv.patch.txt

With this patch in place, I rerun the whole stuff and generate the new log.

As you can see from the attached ndpid.log included in the attached zip, around line ~1.709.087 you have this:

header->ts..: 1664624094,964175
time_us.....: 1664624094964175
time_us_ndpi: 2145216975
globaltime..: 1664624094964175
header->ts..: 1664624094,964175
time_us.....: 1664624094964175
time_us_ndpi: 2145216975
globaltime..: 1664624094964175
header->ts..: 1664624094,964717
time_us.....: 1664624094964717
time_us_ndpi: 2145217517
globaltime..: 1664624094964717
header->ts..: 1664624098,889104
time_us.....: 1664624098889104
time_us_ndpi: 18446744071563726224
globaltime..: 1664624098889104
header->ts..: 1664624098,889243
time_us.....: 1664624098889243
time_us_ndpi: 18446744071563726363
globaltime..: 1664624098889243
header->ts..: 1664624098,352396
time_us.....: 1664624098352396
time_us_ndpi: 18446744071563189516
globaltime..: 1664624098352396
header->ts..: 1664624098,352538
time_us.....: 1664624098352538
time_us_ndpi: 18446744071563189658
globaltime..: 1664624098352538
header->ts..: 1664624098,352731
time_us.....: 1664624098352731
time_us_ndpi: 18446744071563189851
globaltime..: 1664624098352731
header->ts..: 1664624098,352894
time_us.....: 1664624098352894
time_us_ndpi: 18446744071563190014
globaltime..: 1664624098352894

clearly showing the "jump" of the time_us_ndpi. So we can surely state that problem arise inside function ndpi_timeval_to_microseconds.

But we don't care anymore, as --thank to your modification-- calculating time_us directly, solves the problem.

Thank you for support. Now I can finally get back to my "JSON-stream" analyzer :-)

final_data.zip

utoni commented 2 years ago

I will open a PR on https://github.com/ntop/nDPI.git ASAP to fix this issue.

utoni commented 2 years ago

Hi @verzulli, Can you try to rebuild nDPId-testing again to verify that the issue was fixed in the nDPI dev branch.

verzulli commented 2 years ago

Hi @verzulli, Can you try to rebuild nDPId-testing again to verify that the issue was fixed in the nDPI dev branch.

Done!

I've rebuilt the OpenWRT package, uploaded and installed it on my OpenWRT box, and run it sending flows to my local linux box.

Currently, after ~99 minutes, these are the results:

nDPId-rt-analyzer> show counter
=> main `analyzer` counters:
----------
{
  "in": 46325,
  "ok": 46315,
  "err": 10,
  "evt": {
    "flows": 45497,
    "pkt": 326,
    "others": 502
  }
}
----------
=> specific `parser` counters:
----------
{
  "flows": {
    "new": 7372,
    "detected": 7238,
    "detection-update": 7219,
    "guessed": 154,
    "update": 16427,
    "not-detected": 18,
    "end": 2606,
    "idle": 4463
  },
  "packets": {
    "packet": 326
  }
}
----------
=> FLOW TABLE SIZES:
=> Current Global table size...: [    333] flows
=> Current Completed table size: [   7026] flows

----------
started on: Thu Oct 13 2022 10:51:50 GMT+0200 (Ora legale dell’Europa centrale) - uptime: 5965 seconds [99 minutes]
nDPId-rt-analyzer>

I think you solved! Thanks!

utoni commented 2 years ago

<3