n2disk is creating invalid indexes and I thinks this could be affecting rotation and leading to lost of data

igorribeiroduarte commented 4 years ago

I'm facing the following problem with n2disk1g v.3.4.200207 (r5184) : For some reason I still don't know, part of my traffic read by n2disk is coming with invalid timestamps, for example: "4102363817.1799190". I confirmed this is the received timestamp with wireshark. This dues to some n2disk indexes being created on a wrong way, with a future date:

ls -hal index_folder
total 52K
drwxr-xr-x 13 user root   4.0K May  4 00:14 .
drwxr-xr-x  3 user     docker 4.0K Dec 16 16:00 ..
drwxr-x---  3 user docker 4.0K Dec 16 18:02 2019
drwxr-x---  5 user docker 4.0K May  1 00:00 2020
drwxr-x---  3 user docker 4.0K May  2 21:48 2023
drwxr-x---  3 user docker 4.0K May  4 00:14 2026
drwxr-x---  3 user docker 4.0K May  2 22:01 2029
drwxr-x---  3 user docker 4.0K May  3 22:01 2030
drwxr-x---  3 user docker 4.0K May  2 22:01 2031
drwxr-x---  3 user docker 4.0K May  3 03:27 2043
drwxr-x---  3 user docker 4.0K May  3 22:02 2046
drwxr-x---  3 user docker 4.0K May  3 22:37 2068
drwxr-x---  3 user docker 4.0K May  3 00:02 2099

At first, this wouldn't be a problem and I could just ignore those invalid indexes, since they rarely happen. But, sometimes, all my pcaps (42TB of data) are deleted and I believe this is being done by n2disk during it's automatic rotate. I'd like to know if these pcaps being deleted could be related with the timestam problem. I imagine that depending on the way this rotate is done, if n2disk deletes all indexes previous to 2099, for example, this could cause this lost of data.

I'm using the following command to run n2disk: n2disk1g -I -P /var/run/n2disk/n2disk.pid -A index_folder -p 1024 -b 1024 -i nt:stream0 -n 5000 -m 5000 --disk-limit 93% -t 15 -o /disco05 -o /disco06 -o /disco07

My ntservice configuration file:


NumWorkerThreads = 3                     # 1 .. 100
SDRAMFillLevelWarning = 80, 100          # X1, X2, X3, X4
TimestampFormat = PCAP                   # NATIVE - NATIVE_NDIS - NATIVE_UNIX* - PCAP - PCAP_NS
TimestampMethod = EOF                    # UNKNOWN - SOF - EOF*
TimeSyncOsTimeReference = None           # None* - adapter-0 - adapter-1 - adapter-2 - adapter-3 - adapter-4 - adapter-5 - adapter-6 - adapter-7

[Logging]
LogBufferWrap = wrap                     # wrap* - nowrap
LogFileName = /tmp/Log3G_%s.log          # String
LogMask = 7                              #
LogToFile = false                        # true/false
LogToSystem = true                       # true/false

[Adapter0]
AdapterType = NT20E3_2_PTP                 # NT4E - NT20E - NT4E_STD - NTPORT4E - NT20E2 - NT40E2_1 - NT40E2_4 - NT4E2_BP - NT4E2_PTP - NT20E2_PTP - NT20E3_2_PTP - NT40E3_4_PTP - NT100E3_1_PTP
DiscardSize = 16                         # 16 .. 63
HostBufferHandlerAffinity = -2           # -2 .. 31
HostBufferPollInterval = default         # default* - 100 - 250 - 500 - 1000
HostBufferSegmentTimeOut = default       # default* - 100 - 250 - 500 - 1000
IfgMode = NS                             # NS* - BYTE
MaxFrameSize = 9018                      # 1518 .. 10000
OnBoardMemorySplit = Even                # Even* - Proportional
HostBuffersRx = [14,32,-1]               # [x1, x2, 0], ...
HostBuffersTx = [2,32,-1]                # [x1, x2, 0], ...
PacketDescriptor = PCAP                  # PCAP - NT* - Ext7 - Ext8 - Ext9
SofLinkSpeed = 10G                       # 100M - 1G - 10G
Profile = Capture                        # None* - Capture - Inline - CaptureReplay - TrafficGen
TimeSyncConnectorExt1 = NttsIn           # None - NttsIn* - PpsIn - NttsOut - PpsOut - RepeatInt1 - RepeatInt2
TimeSyncConnectorInt1 = None             # None* - NttsIn - PpsIn - NttsOut - PpsOut - RepeatExt1 - RepeatInt2
TimeSyncConnectorInt2 = None             # None* - NttsIn - PpsIn - NttsOut - PpsOut - RepeatExt1 - RepeatInt1
TimeSyncNTTSInSyncLimit = 5000           # 1 .. 4294967295
TimeSyncOSInSyncLimit = 50000            # 1 .. 4294967295
TimeSyncPPSInSyncLimit = 5000            # 1 .. 4294967295
TimeSyncReferencePriority = Ext1, FreeRun # FreeRun* - PTP - Int1 - Int2 - Ext1 - OSTime
TimeSyncTimeOffset = 0                   # 0 .. 1000000```

Thanks in advance.

cardigliano commented 4 years ago

could you try setting PCAP_NS instead of PCAP as TimestampFormat and see if you are still able to reproduce the issue?

igorribeiroduarte commented 4 years ago

I couldn't say:

The invalid index problem could take a week to happen
The lost of data problem could take a whole month or even more
I use a binary for extracting files from n2disk pcaps and this binary is not compatible with the pcaps generated by n2disk when using PCAP_NS TimestampFormat.
Changing ntservice configuration would require a restart of some of the services running on this machine and I couldn't restart these services right now

cardigliano commented 4 years ago

Ok got it, let me investigare deeply into this then.

cardigliano commented 4 years ago

@igorribeiroduarte I was investigating and I realized you are using n2disk1g with napatech, which is not a common configuration as the 1g version does not support chunk-mode. I will investigate on this.

igorribeiroduarte commented 4 years ago

It happened again, but this time I have some logs for you: logs_n2disk.txt

There are some gaps between the logs. I'm not sure about the reason, but could be the container orchestrator (we use docker swarm) having problems to restart n2disk service.

As you can see, the first disk (disco07) seems to be erased after the "Unable to write into file" error and the same happen to the other disks after some minutes.

Also, n2disk seems to have erased that invalid index folders:

ls -hal index_folder
total 16K
drwxr-xr-x 4 user root   4.0K May 13 20:23 .
drwxr-xr-x 3 user     docker 4.0K Dec 16 16:00 ..
drwxr-x--- 3 useer docker 4.0K Dec 16 18:02 2019
drwxr-x--- 4 user docker 4.0K May 13 20:24 2020

cardigliano commented 4 years ago

I had a look at the log, it seems n2disk is failing creating files due to "ERROR: Unable to write into file /disco06/.. [No such file or directory]", that means the folder does not exist at the time it tries to dump the file. You said "the first disk (disco07) seems to be erased", was it completely empty? are you sure is n2disk erasing all the disk?

cardigliano commented 4 years ago

A more verbose output could also help, please add -v

igorribeiroduarte commented 4 years ago

As you can see below, the only folders n2disk didnt' erase are 1585290220.750124 and 1586538427.963111, and this happened because these both folders have a remaining .tmp file and n2disk only erases .pcap files, right?

ls -hal /disco05/
total 1.5M
drwxr-xr-x  4 user root   124K May 14 08:39 .
drwxr-xr-x 31 root      root   4.0K Dec 26 15:06 ..
drwxr-x---  2 user docker 1.2M May 14 08:39 1589412259.992738
drwxr-x---  2 user docker 268K May 14 11:44 1589456325.900914

ls -hal /disco06/
total 2.4M
drwxr-xr-x  5 user root   124K May 14 08:39 .
drwxr-xr-x 31 root      root   4.0K Dec 26 15:06 ..
drwxr-x---  2 user docker 952K Apr 13 09:37 1585290220.750124
drwxr-x---  2 user docker 1.1M May 14 08:39 1589412263.883094
drwxr-x---  2 user docker 256K May 14 11:44 1589456326.330455

ls -hal /disco07/
total 166M
drwxr-xr-x  5 user root   112K May 14 08:39 .
drwxr-xr-x 31 root      root   4.0K Dec 26 15:06 ..
drwxr-x---  2 user docker 552K Apr 13 09:14 1586538427.963111
drwxr-x---  2 user docker 1.1M May 14 08:39 1589412266.310233
drwxr-x---  2 user docker 256K May 14 11:44 1589456328.376376

Also, what I said about disco07 being erased was based on n2disk storage logs before and after the "Unable to write into file error" : These logs below are showed after the restart that happened due to the "Unable to write into file error" of disco07

13/May/2020 20:01:22 [n2disk.c:2680] Storage /disco07: 0.00 GB in use
13/May/2020 20:01:24 [n2disk.c:2680] Storage /disco06: 11839.78 GB in use
13/May/2020 20:01:25 [n2disk.c:2680] Storage /disco05: 13181.03 GB in use

And these are showed after the restart that happened due to the same error but with disco05 and disco06 (At 20:23:39 and at 20:23:59):

13/May/2020 20:24:28 [n2disk.c:2680] Storage /disco05: 0.00 GB in use
13/May/2020 20:24:28 [n2disk.c:2680] Storage /disco07: 0.00 GB in use
13/May/2020 20:24:28 [n2disk.c:2680] Storage /disco06: 0.00 GB in use

igorribeiroduarte commented 4 years ago

I can change the log level, but it will take a lot for this problem to happen again, because it only happens after the disks are full

igorribeiroduarte commented 4 years ago

@cardigliano , It happened again. Now I have more verbose logs:

On this first log, you can see at the line 17 that disco07 had 13197.50GB used at 18:00, and after the cleanup only 7613.68GB remained. Almost half of the pcaps were deleted. disco07_from_13T_to_7T.txt

Now, on this second log, you can see at the line 6 that disco05 had 13198GB used at 22:09, disco06 had 13141GB (line 17) and disco07 had 8018G (line 466). At the end of the log, disco07 and disco05 were completely erased and disco06 had only 4124GB of pcaps remaining; disco05_from13T_to_0T.txt

On this second log, I can also see that the n2disk container died during the cleanup, possibly due to some instability on our stack, but I don't think this should be making n2disk delete all these pcaps.

cardigliano commented 4 years ago

I pushed a fix that could address this, and added more info to the logs to check what is causing this in case it happens again. A new build will be available soon.

igorribeiroduarte commented 4 years ago

It happened again even after updating:

logs_n2disk_02_07_2020.txt

cardigliano commented 4 years ago

Update: a workaround has been provided for this to avoid affecting file rotation, however it seems this was due to bad timestamps and we are still investigating.

ntop / n2disk

n2disk is creating invalid indexes and I thinks this could be affecting rotation and leading to lost of data #26