timbray / topfew

Finds the field values (or combinations of values) which appear most often in a stream of records.
GNU General Public License v3.0
188 stars 6 forks source link

Quoted fields #28

Closed timbray closed 6 months ago

timbray commented 6 months ago

Personally, I mostly use tf on apache https access_log files and have used the default space-separation. This mostly works, but sometimes not, see the two following lines:

i577a483c.versanet.de - - [12/Mar/2007:08:03:37 -0800] "GET /ongoing/ongoing.atom HTTP/1.1" 304 - "-" "NetNewsWire/2.1 (Mac OS X; http://ranchero.com/netnewswire/)"
105.66.1.178 - - [19/Apr/2020:06:38:44 -0700] "-" 408 156 "-" "-"

In the first one, the target URL is (space-separated) field number 7. In the second one, which I believe represents someone connecting to the server and not doing anything till it times out, there is no HTTP verb and field 7 is the HTTP status signaling timeout.

If topfew could recognize quoted fields, then field 6 in the first example would be GET /ongoing/ongoing.atom and in the second would be -, which would be more correct from the point of topfew processing. So I think there needs to be a -q option, or some such, to ask topfew to process quote-delimited space-separated fields properly.

timbray commented 6 months ago

Fixed in #29