kassner / log-parser

PHP Web Server Log Parser Library
Apache License 2.0
339 stars 64 forks source link

Throws Exceptions when Encountering Real World Logs #50

Closed astorm closed 4 years ago

astorm commented 4 years ago

Hello there -- first off, thank you for building this and saving us all the trouble of building our own regular expressions to parse Apache's log files.

When I tried using this package on my actual real world Apache logs, it mostly worked. However, there were a number of different lines where it failed to parse logs and threw an exception in my program. Here's one example

My log format looks like this

$parser->setFormat('%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"');

Here's one line that failed to parse

199.195.254.38 - - [27/Sep/2020:19:27:26 +0000] "GET ../../proc/ HTTP" 400 506 "-" "-"

and here's a few others

240e:d9:d800:200::d4 - - [29/Sep/2020:19:52:18 +0000] "\x16\x03\x01" 501 290 "-" "-"

172.105.43.21 - - [30/Sep/2020:01:05:53 +0000] "\x16\x03\x01" 501 290 "-" "-"

Is there a way to configure this library to be less strict when trying to parse these log lines?

If not, do you have any time/interest in enhancing the functionality of this library so it can handle cases like these?

kassner commented 4 years ago

Hi. This case is similar to #49, which involves badly formed HTTP requests. Given they're not technically valid, I don't know how much value do you get parsing them, but the $parser->addPattern('%r', '(?P<request>.+)'); trick mentioned there is a good workaround if the main parsing failed.

I'd keep parsing logs with the format you have and have a second instance of LogParser configured with the addPattern and parse the line again to extract things like IP address and User-Agent.

Something like:

$parser = new LogParser();
$parser->setFormat('%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"');

$laxParser = new LogParser();
$laxParser->setFormat('%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"');
$laxParser->addPattern('%r', '(?P<request>.+)');

foreach ($lines as $line) {
    try {
        try {
            $entry = $parser->parse($line);
        } catch (FormatException $e) {
            $entry = $laxParser->parse($line);
        }
    } catch (FormatException $e) {
        continue;
    }

    // process $entry
}
astorm commented 4 years ago

This case is similar to #49, which involves badly formed HTTP requests.

I'd keep parsing logs with the format you have and have a second instance ...

While it's not what I wanted to hear -- that's a fair philosophy. Closing out.