phaag / nfdump

Netflow processing tools
Other
770 stars 202 forks source link

Feature: add ndjson output #553

Closed sandervandegeijn closed 1 month ago

sandervandegeijn commented 1 month ago

Most log outputs for massive data use new line delimited json to be able to parse large files. I ran the json-log output with the -q parameter but had to drop that because of:

https://github.com/phaag/nfdump/issues/552

Unknowingly I got lucky, that probably drops the first line with the [. I didn't look to closely at the rest, so it looked like ndjson. Without the -q parameter the output starts with:

image

[ {doc1}, {doc2} ]

so that's one huge json array with objects within that for each document instead of

{doc1} {doc2}

Why don't you want to use the first one? Basically you need to deserialize the whole json doc in memory before you are able to split them. That's fine if it's a few hundred K, but for nfdump we're talking hundreds of megabytes. The ndjson format prevents that, a parser just reads until the newline and deserializes that particular document.

phaag commented 1 month ago

That bug is related to #552 and will be fixed with the same commit.

sandervandegeijn commented 1 month ago

Well, to be nit picky: they are somewhat different. Technically the current implementation without the -q statement does produce valid json! It's very difficult to parse with large files, but it's valid. Maybe it's even useful for some use cases. So this might not be a bug :) What could be a bug is that the -q option with json-log produces invalid json (because the opening bracket goes missing).

The ndjson format is different, you also could add another output option to deliver ndjson specifically. Mind the lack of the , at the end of a line with ndjson. Each line is just one json document, no extras :)

Thanks Peter!

phaag commented 1 month ago

There was no intention to drop the array markers [ and ] - its's simply a bug. That bug exists for a long time already, but nobody was obviously using json, json-log with -q. So, the next commit will bring back the array markers.

sandervandegeijn commented 1 month ago

Okay cool, so that fixes the json-output! That's great.

But my question something else: can we add ndjson output? That's new line delimited json and different from the current implementation.

phaag commented 1 month ago

Well - I just ask myself, why do we need so many versions of json? Multiline - arrays with multiple objects - comma separated, new line separated ?? I am sure, there are for sure some others .. It's not, that it could not be done, but does it make sense and is it the task of nfdump to deliver each possible flavour of json?

phaag commented 1 month ago

One option would be to switch the json-log to ndjson. Anyone opposed?

sandervandegeijn commented 1 month ago

If I have to choose with these amounts of data: ndjson all the way.

But you are breaking current functionality if you don't implement it alongside the current output. If you do implement it this way, it would be nice to give ndjson another output name so people will get explicit errors when using the current output which isn't supported anymore.

adulau commented 1 month ago

ndjson everywhere!

phaag commented 1 month ago

The output format json-log is pretty new (3 months) and not yet in any release. If people are in favour of the output format ndjson, I would propose to adapt that accordingly. Although I agree, that we should not break the functionality, I would argue, that this is still in the process finding the most common ground working for most people. I am not a json expert, as I work natively with my nfdump files. :)

@blkmajik, @sandervandegeijn , @adulau - unless anyone opposed, I will add these changes.

sandervandegeijn commented 1 month ago

yeah sure, if it's not in stable, people took a risk by using the main branch. Let's do it :)

The reason we like json is because it's easilly parsable and in our case we are ingesting it in opensearch (json based). But transforming it to XML of CSV is pretty easy because it is well structured. We are slowly(!) moving away from the nfsen frontend and are beginning to use opensearch for everything to have a single pane of glass.

phaag commented 1 month ago

The current master replaces -o json-log by -o ndjson according to the discussion. Please test, if this work for your cases. If you use any json output together with the maxmind DB, please rebuild the maxmind DB to replace the " by ' in org and city names. Otherwise it may break the json object.

sandervandegeijn commented 1 month ago

We got the latest main and the format is working like a charm. Thanks a lot! But has the output changed? In the output I'm now getting "tcp_flags":16 instead of the 6 positions and the letters?

phaag commented 1 month ago

The log format was always with the number. If people think, it's better with the alpha flags, I can change that.

phaag commented 1 month ago

Changed.

sandervandegeijn commented 1 month ago

I don't know which is better, but I'm very sure that it was with the lettered flags.

image

This worked. As to which is preferable? I find the letters easier to read as a human. (but who am I...)

phaag commented 1 month ago

the latest commit is adapted

phaag commented 1 month ago

Seems to work ..