timbray / topfew

Finds the field values (or combinations of values) which appear most often in a stream of records.
GNU General Public License v3.0
186 stars 6 forks source link

A list of fields doesn't do what I expect if a number is lower than a previous number #8

Closed rjw1 closed 3 years ago

rjw1 commented 3 years ago

If you dont list your fields in ascending numerical order it displays the next field instead of the one you asked for.

 $  ./bin/tf -f 9,7 test/data/small
102 200 2177
99 200 781
96 200 72594
94 200 621
60 200 10660
56 200 3848
50 200 6958
47 200 10465
41 200 2477
31 200 7744

 $  ./bin/tf -f 9,1 test/data/small
102 200 2177
99 200 781
96 200 72594
94 200 621
60 200 10660
56 200 3848
50 200 6958
47 200 10465
41 200 2477
31 200 7744

 $  ./bin/tf -f 9,7,1 test/data/small
96 200 2177 "https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon"
85 200 621 "https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon"
81 200 781 "https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon"
53 200 3848 "https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon"
50 200 72594 "https://www.tbray.org/ongoing/serif.css"
41 200 72594 "https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon"
37 200 2477 "https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon"
30 200 7744 "https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon"
28 "-" "-"
27 200 10660 "https://news.ycombinator.com/"

When in fact I would expect ./bin/tf -f 9,7 test/data/small to behave like awk '{print $9 " " $7}' test/data/small | sort | uniq -c | sort -rn | head

 $  awk '{print $9 " " $7}' test/data/small | sort | uniq -c | sort -rn | head
 136 200 /ongoing/When/202x/2020/04/29/Leaving-Amazon
 119 200 /ongoing/in-feed.xml
 112 200 /ongoing/serif.css
 112 200 /ongoing/ongoing.js
 109 200 /ongoing/Feed.png
 104 200 /ongoing/darkwater60.jpg
  95 200 /ongoing/picInfo.xml?o=https://www.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon
  89 200 /favicon.ico
  28 "-" 408
  12 200 /ongoing/picInfo.xml?o=https://old.tbray.org/ongoing/When/202x/2020/04/29/Leaving-Amazon
timbray commented 3 years ago

Having thought about this, I lean to declaring this a non-issue and adjusting the documentation to say the field list has to be in increasing order. Two reasons: For any permutation of the same field numbers, tf should generate the same occurrence counts and result list (if it doesn't, that'd be a bug for sure), so it's not obvious what the benefit of doing this is. Second, the most important feature of tf is that it's fast, and since the field extraction has to be done on every record, it's on the performance critical path. At the moment, the field extraction is highly optimized and relies on being able to work through the record accepting the fields there in the -f and stopping when it gets to a big enough number that there are no more to come. Adding a step to shuffle the field strings around might be cheap but would add up since you have to do it for every line.

On the other hand, if there's an interesting use case that would be enabled by permuting the field list, I'd be happy to hear about it.

rjw1 commented 3 years ago

If the ordering of the field list matters then maybe topfew could sort that itself before executing. This is seemingly what topfew-rs is doing. (This is like when ecommerce sites get upset if you add spaces to a credit card number. It should just strip the spaces out and carry on with taking payment).

Once the extraction of the fields and any computation is done could topfew then display the fields in the order the user asked for or is it that also tied into the optimized extraction.

I could of course just pipe the results into awk to get them displayed in the order that I want.

timbray commented 3 years ago

Ah, OK, so you could say -f 5,3 but you'd still the third then fifth fields in the output. You're right, that wouldn't hurt performance, but feels like sort of surprising/counterintuitive behavior.

rjw1 commented 3 years ago

Yeah, it would certainly be suprising but at least returns the the data I asked for. Erroring and saying that the fields should be in ascending order would be okay.

timbray commented 3 years ago

OK, will do that. BTW, what's your use case?

rjw1 commented 3 years ago

I normally want to see the consequences of a incident so the data I want to see first is the http response code and then the other info afterwards. Most log files don't put the response code earlier.