phaag / nfdump

Netflow processing tools
Other
791 stars 206 forks source link

restore -o pipe? #548

Closed mteg closed 4 months ago

mteg commented 4 months ago

Summary: May I kindly ask to reconsider removing the CPU efficient -o pipe output format? If it needs some work in the code, I am happy to volunteer to try it and propose a pull request.

Long story: I work for a small telco company. We have been hit by some performance issues in our traffic accounting system that is built around nfcapd and nfdump. Today, after some very enjoyable hacking involving running different versions of nfdump under strace and ltrace, we discovered one of the bottlenecks is output formatting in nfdump. In our particular setup this means we use a lot of CPU time to convert binary data into text in vain, because our next step actually involves parsing the nfdump output into binary data again.

Anyway, ignoring the particularities of our case, the problem is as follows: With nfdump 1.7.3 using -o pipe we are able to dump ~20M records in some 20 seconds. With -o line it takes two minutes to process the same input data. A sixfold difference (!).

Apparently the issue is a consequence of overall flexibility of the formatting engine. If we try -o 'fmt:%ts %td %pr %sap -> %dap %pkt %byt %fl' we are still in the 2 minute ballpark. However leaving only -o 'fmt:%ts' reduces the dump time to approx. 60 seconds (... but obviously the output is not useful at all). Anything in between results in execution times somewhere in between, depending on the number of data fields used.

The current code from master is not any faster in this regard. However it removes support for -o pipe leaving us without a workaround for the future.

phaag commented 4 months ago

My answer is also a long story :)

The pipe format has been introduced 20 years ago. That's a long time! And it if did not really change a lot since then. However, nfdump as a program changed dramatically. The pipe format did not follow any standards and my decision was to remove the oldtimer. Up until the last code version in 1.7.4 the format was still not very flexible and could display a very small subset of a flow record. The idea was to replace it by a more flexible cvs version, which basically works as the flexible fmt format engine. It looks like - and I am sure, you are not the only one - that you are using text output for millions of records, which I did never expect. In general txt output is very inefficient as lots of numbers need to be converted and styled some way, that it also should look a bit pretty. That txt output is read again into another program and you need again lots of CPU to parse and to make it error prone. That's why txt output is basically always a kind of bottleneck.

I would like to keep the pipe format retired, as it deserves for the long time of duty. Although I see your point and would propose the following:

The fastest solution would be to read binary data and integrate a nfdump reader into your code. That's not that difficult but needs some work. I'd happily could support you here.

Another option would be to speed up the plain -o csv format to print in its default form the basic flow record. You would need to change the '|' for a ',' The extended csv is still available be would be slower, because of its flexibility similar to -o fmt The default csv could be integrated again statically as pipe was and you should get about the same speed.

In general I will take a look to speed up the txt output formats, so that large flow sets should print faster.

mteg commented 4 months ago

Thanks for the detailed answer, much appreciated!

The fastest solution would be to read binary data

I have seen the example (nfreader.c) and yes, it definitely would yield the lowest footprint, in terms of energy, CO2 produced, money spent on processing power and probably a few other metrics :-).

However using plain text as a glue between different system components has also some advantages. Much easier debugging, not requiring a C wizard ... and, most of the time, ease of adjusting for breaking changes in the individual system components. In the telco business, IT solutions live much longer than, say, in the smartphone world....

In general I will take a look to speed up the txt output formats

That would be awesome and I also thought that if we end up writing some C, then it might be better to contribute it on the nfdump side!

I was looking into output_fmt.c and all these String_....(FILE *stream, ...) functions mostly ending in a fprintf() call. I thought maybe restructure that ... and instead use these functions to build a big format string and a big va_list for vfprintf() and then just call it once. But then I fiddled with this "simulation" of output_fmt.c:

    for(i = 0; i<10000000; i += 4) {
        printf("%d", i);
        printf("%d", i+1); 
        printf("%d", i+2);
        printf("%d", i+3); 
        printf("\n");
    }

vs a "simulation" of output_pipe.c that actually does one big fprintf() https://github.com/phaag/nfdump/blob/4f409e984ad0c464b899f13cb58ded7d3cf7f71b/src/output/output_pipe.c#L96

    for(i = 0; i<10000000; i += 4) {
        printf("%d %d %d %d\n", i, i+1, i+2, i+3);
    }

... and there is no appreciable difference ie. my "simulations" are not really relevant to the real problem in nfdump.

So maybe the performance difference comes from the sole fact of making so many function calls out of the token_list[] array? Just a wild guess, but honestly a third quick "simulation" I tried didn't really prove it... Anyway an interesting experiment would be turn all the static String_SrcPort() { ... } functions into case STRING_SRCPORT: { ... } blocks and turn the call https://github.com/phaag/nfdump/blob/b6179a6f5e319363204d934fec292d468706a8fd/src/output/output_fmt.c#L656 into a big switch(). It's a dirty trick and there is limited hope for improvement. But who knows, maybe then it all optimizes better within the compiler ... and it might be cheap to test if only I manage to write a script to automate this code change :)

mteg commented 4 months ago

fprintf(stream, "%s", token_list[i].string_buffer);

Oh, one more thing. I reckon this could easily be changed to a fputs() call and already somewhat improve things. As far as I understand, in real life, with the default formats, this code mostly prints single spaces with all the overhead of a %s substitution. But apparently it won't save the world in the end, I just run 'fmt:%ts%td%pr%sap%dap%pkt%byt%fl' vs fmt:%ts %td %pr %sap -> %dap %pkt %byt %fl on 1 000 000 records and it's 5,7s vs 6,0s.

mteg commented 4 months ago

Anyway an interesting experiment would be turn all the static String_SrcPort() { ... } functions into case STRING_SRCPORT: { ... } blocks

I tested this and it actually made things slightly worse (by ~20%).

However I also discovered a 2x speed difference (!) between these two formats: -o 'fmt:%td %pr %sap -> %dap %pkt %byt %fl' -o 'fmt:%ts %td %pr %sap -> %dap %pkt %byt %fl'

If I change localtime() into gmtime() here: https://github.com/phaag/nfdump/blob/b6179a6f5e319363204d934fec292d468706a8fd/src/output/output_fmt.c#L881

Then I also encounter a 2x speedup even with the default format (%ts in use)!

Probably related to this: https://stackoverflow.com/questions/41544774/localtime-takes-24-times-more-than-gmtime-performance-issue-on-linux

phaag commented 4 months ago

Thanks for this hint! I will check that.

phaag commented 4 months ago

It also looks like, if it's also platform/OS dependant, how much time you gain. To replace localtime with gmtime is a bit tricky, as the timezone calculation has to be done at the time, the flow existed (start, tend). Besides the local time/gmtime issue, it looks like, that optimising the countless s/f/printf calls with a local buffer speeds up the json output between 40% up to 300% depending of the OS. I selected the json output, as it was the slowest one. The next iterations follows ...

mteg commented 4 months ago

To replace localtime with gmtime is a bit tricky, as the timezone calculation has to be done at the time, the flow existed (start, tend).

Agree 100%. But I guess for many users (not only us) it does not matter if it's UTC or local time being dumped ... as long as we know which one :). We will use %tsr for the time being and we're OK with that. But perhaps implementing new formats, for example %tsu and %teu, using gmtime() rather than localtime() would not hurt much. Just a suggestion.

phaag commented 4 months ago

I added a csv-fast output mode in the master repo, which replaces the pipe mode. It prints the msec for start and stop, and therefore does not need any calls to localtime and gmtime. The mode csv-fast is a fixed format and cannot be changed. Further optimisations to come ..

phaag commented 4 months ago

The implementation of csv-fast now is faster than old pipe.

mteg commented 4 months ago

Perfect.

After these few days of discussion and insights I sure already have enough data and options to optimize our use case.

I also added a PR #550 containing some of my fiddling with the code. Feel free to reject it outright - or include some of the ideas in the main code repo.

Thank you very much for your help!

phaag commented 4 months ago

Thanks for pointing out enhancements! I will close this issue.