Can't open a given nffile which nfdump does

gabrielmocan commented 5 months ago

Hi Pete,

I have a nffile that is properly decoded using classic nfdump 1.7.4-a16f86f but throws an error when using go-nfdump v0.0.4.

Below some logs and I'm attaching the sample.

root# nfdump -V
nfdump: Version: 1.7.4-a16f86f Options: NSEL-NEL ZSTD BZIP2 Date: 2024-04-07 15:15:55 +0200
root# nfdump -r nfcapd.202404090803 | head 
Date first seen            Event  XEvent Proto      Src IP Addr:Port          Dst IP Addr:Port     X-Src IP Addr:Port        X-Src IP Addr:Port   In Byte Out Byte
2024-04-09 08:08:07.000 <no-evt> <no-evt> UDP     170.245.223.82:56023 ->   104.237.172.31:443            0.0.0.0:0     ->          0.0.0.0:0        86000        0
2024-04-09 08:08:07.000 <no-evt> <no-evt> UDP      45.164.10.165:35258 ->   157.240.216.16:443            0.0.0.0:0     ->          0.0.0.0:0        1.3 M        0
2024-04-09 08:08:07.000 <no-evt> <no-evt> UDP      170.79.169.43:50342 ->    128.14.119.70:28571          0.0.0.0:0     ->          0.0.0.0:0        78000        0
2024-04-09 08:07:21.000 <no-evt> <no-evt> TCP     128.201.197.87:48740 ->  168.196.119.196:443            0.0.0.0:0     ->          0.0.0.0:0        1.2 M        0
2024-04-09 08:08:07.000 <no-evt> <no-evt> TCP      13.107.213.33:443   ->   190.89.233.250:53696          0.0.0.0:0     ->          0.0.0.0:0        1.5 M        0
2024-04-09 08:08:07.000 <no-evt> <no-evt> TCP     169.150.250.39:443   ->    170.245.67.26:52192          0.0.0.0:0     ->          0.0.0.0:0        1.5 M        0
2024-04-09 08:08:07.000 <no-evt> <no-evt> TCP     104.237.189.22:443   ->     168.194.15.8:60156          0.0.0.0:0     ->          0.0.0.0:0        1.5 M        0
2024-04-09 08:08:07.000 <no-evt> <no-evt> TCP     157.240.216.60:443   ->     45.182.170.9:7122           0.0.0.0:0     ->          0.0.0.0:0        4.3 M        0
2024-04-09 08:08:07.000 <no-evt> <no-evt> UDP    148.153.194.118:8953  ->   170.79.169.242:27538          0.0.0.0:0     ->          0.0.0.0:0        1.4 M        0

root# nfdumpNative nfcapd.202404090803
panic: runtime error: slice bounds out of range [:290] with capacity 172

goroutine 11 [running]:
github.com/phaag/go-nfdump.NewRecord(...)
    /go/pkg/mod/github.com/phaag/go-nfdump@v0.0.4/record.go:67
github.com/phaag/go-nfdump.(*NfFile).AllRecords.func1()
    /go/pkg/mod/github.com/phaag/go-nfdump@v0.0.4/nffile.go:280 +0x1655
created by github.com/phaag/go-nfdump.(*NfFile).AllRecords in goroutine 1
    /go/pkg/mod/github.com/phaag/go-nfdump@v0.0.4/nffile.go:268 +0x7b

broken.sample.zip

phaag commented 5 months ago

I'll have a look. Thanks for the sample! This always helps!

phaag commented 4 months ago

It was indeed a boundary check error in the go decoding code. I fixed that in master. An updated new version will follow. Nfdump has a boundary check integrated already, but I also improved that in nfdump master. The boundary check skips bad records. For some reason, one single record in your file is corrupt.

gabrielmocan commented 4 months ago

@phaag I have yet another file that is not passing boundary check and panicking. Sample is attached.

Record 64357: decoding error: Record body boundary check error
Record 64406: decoding error: Record body boundary check error
panic: runtime error: slice bounds out of range [:1635282] with capacity 1635280

goroutine 6 [running]:
github.com/phaag/go-nfdump.(*NfFile).AllRecords.func1()
        /Users/gabemocan-mw/go/pkg/mod/github.com/phaag/go-nfdump@v0.0.5/nffile.go:277 +0x14a0
created by github.com/phaag/go-nfdump.(*NfFile).AllRecords in goroutine 1
        /Users/gabemocan-mw/go/pkg/mod/github.com/phaag/go-nfdump@v0.0.5/nffile.go:270 +0x80
exit status 2

broken2.sample.zip

phaag commented 4 months ago

That sample is really corrupt! However, I need to friendly exit or skip.

phaag commented 4 months ago

A datablock is missing records. Do you have multiple processes writing to the same file?

% nfdump -v broken2.sample                                                                                                                                                                                                  Darwin 23.4.0
File       : broken2.sample
Version    : 2 - not compressed
Created    : 2024-05-05 13:52:00
Created by : nfcapd
nfdump     : f1070400
encryption : no
Appdx blks : 1
Data blks  : 6
Checking data blocks
Block 5 num records 9255 != counted records: 9250

gabrielmocan commented 4 months ago

A datablock is missing records. Do you have multiple processes writing to the same file?

It's a single nfcapd -n ... -n ... -n ... process with multiple directories, one for each exporter, so, no multiple processes writing to the same file.

But I suspect something is wrong with the VM hosting this collector. I'm having segfaults I can't explain on my processing code, although no errors on nfcapd process. Maybe physical memory fault or faulty storage, still not sure.

That sample is really corrupt! However, I need to friendly exit or skip.

For now that would do the trick, just to avoid the panic calls.

phaag commented 4 months ago

I added another data block boundary check! It spits an error, but does no longer crashes!

phaag commented 4 months ago

Have you checked the syslog file? any specific error messages of the collector?

gabrielmocan commented 4 months ago

I added another data block boundary check! It spits an error, but does no longer crashes!

Thanks! Will try right away.

Have you checked the syslog file? any specific error messages of the collector?

Apparently no errors on the collector side, I run it on a dedicated container. Logs are clean.

gabrielmocan commented 4 months ago

I added another data block boundary check! It spits an error, but does no longer crashes!

I guess the output can be less verbose, this line Next block... it's not needed, the error log is the important.

go run . nfdumpNative ../tests/samples/broken2.sample
Next block - type: 3, records: 11892, size: 2097072  
Next block - type: 3, records: 11883, size: 2097060
Next block - type: 3, records: 11874, size: 2097048
Next block - type: 3, records: 11881, size: 2097100
Next block - type: 3, records: 11857, size: 2097004
Next block - type: 3, records: 9255, size: 1635280
Record 64357: decoding error: Record body boundary check error
Record 64406: decoding error: Record body boundary check error
DataBlock error: count: 9255, size: 1635280. Found: 9250, size: 1635280

phaag commented 4 months ago

Sorry - fixed.

gabrielmocan commented 3 months ago

@phaag just to feedback to you, this VM had faulty memory. That's why files were so messed up! Still, we made the code more resilient. That's good anyways.

But I suspect something is wrong with the VM hosting this collector. I'm having segfaults I can't explain on my processing code, although no errors on nfcapd process. Maybe physical memory fault or faulty storage, still not sure.

phaag / go-nfdump

Can't open a given nffile which nfdump does #11