kassner / log-parser

PHP Web Server Log Parser Library
Apache License 2.0
334 stars 64 forks source link

Help with Pattern Matching? #59

Closed SAH62 closed 2 weeks ago

SAH62 commented 1 year ago

This is related to issues #58 and #50.

As described in #58, I'm getting some malformed HTTP requests in my nginx server access log, like these:

162.243.128.19 - - [18/Feb/2023:06:38:30 -0500] "MGLNDD_70.110.25.35_80" 400 150 "-" "-"
159.65.204.184 - - [19/Feb/2023:02:54:04 -0500] "\x16\x03\x01\x00{\x01\x00\x00w\x03\x03\x03VJT\xE3REk\xFE\x89\x5C\xCE\xFF\xBBh\xAF\xA5}@t6\x9D\xBA\xAA3\x22rWR\xAC\xB8\x90\x00\x00\x1A\xC0/\xC0+\xC0\x11\xC0\x07\xC0\x13\xC0\x09\xC0\x14\xC0" 400 150 "-" "-"
192.241.225.22 - - [19/Feb/2023:05:16:37 -0500] "SSH-2.0-Go" 400 150 "-" "-"

I've modified my code using the fix suggested in #58 and described in #50:

$laxParser = new \Kassner\LogParser\LogParser();
$laxParser->setFormat('%h %l %u %t "%r" %>s %O "%{Referer}i" \"%{User-Agent}i"');
$laxParser->addPattern('%r', '(?P<request>.+)');
...
$entry = $laxParser->parse($line);

However, the lines above are still causing a FormatException to be thrown. Did I miss something, or is there another pattern match that's failing here? If the "addPattern" worked, it should match instances of 1 or more of any character in the request- right?

SAH62 commented 1 year ago

I see what's happening here. Calling $laxParser->addPattern doesn't update the value of $pcreFormat, so when $laxParser->parse is called, it's still using the value of $pcreFormat that was set when $laxParser->setFormat is first called. Something needs to be added to the addPattern function to update the $pcreFormat variable. As a workaround, this set of calls (calling setFormat a second time) seems to work:

$laxParser = new \Kassner\LogParser\LogParser();
$laxParser->setFormat('%h %l %u %t "%r" %>s %O "%{Referer}i" \"%{User-Agent}i"');
$laxParser->addPattern('%r', '(?P<request>(?:(?:[A-Z]+) .+? HTTP\/(1\.0|1\.1|2\.0))|-|.+)');
$laxParser->setFormat('%h %l %u %t "%r" %>s %O "%{Referer}i" \"%{User-Agent}i"');
SAH62 commented 1 year ago

Something like this might work, but it requires a change in thinking of how a new pattern is added. In this example, I want to add '.+' to the end of the request pattern:

    public function addPattern(string $placeholder, string $pattern): void
    {
        // Update the pattern.
        $oldPattern = $this->patterns[$placeholder];
        // Insert the new pattern string at the end of the existing patterns.
        $newPattern = substr_replace($oldPattern, $pattern, -1, 0);
        $this->patterns[$placeholder] = $newPattern;

        // Update the regular expression to include the new pattern.
        $start = stripos($this->pcreFormat, $oldPattern);
        $oldLen = strlen($oldPattern);
        $this->pcreFormat = substr_replace($this->pcreFormat, $newPattern, $start, $oldLen);
        $this->updateIpPatterns();
    }

Use:

$laxParser = new \Kassner\LogParser\LogParser();
$laxParser->setFormat('%h %l %u %t "%r" %>s %O "%{Referer}i" \"%{User-Agent}i"');
$laxParser->addPattern('%r', '.+|');

This will change the value of $this->pcreFormat from this:

"#^(?P<host>[a-zA-Z0-9\-\._:]+) (?P<logname>(?:-|[\w-]+)) (?P<user>(?:-|[\w\-\.]+)) \[(?P<time>\d{2}/(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)/\d{4}:\d{2}:\d{2}:\d{2} (?:-|\+)\d{4})\] "(?P<request>(?:(?:[A-Z]+) .+? HTTP/(1\.0|1\.1|2\.0))|-|)" (?P<status>\d{3}|-) (?P<sentBytes>[0-9]+) "(?P<HeaderReferer>.*?)" \"(?P<HeaderUserAgent>.*?)"$#"

to this:

"#^(?P<host>[a-zA-Z0-9\-\._:]+) (?P<logname>(?:-|[\w-]+)) (?P<user>(?:-|[\w\-\.]+)) \[(?P<time>\d{2}/(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)/\d{4}:\d{2}:\d{2}:\d{2} (?:-|\+)\d{4})\] "(?P<request>(?:(?:[A-Z]+) .+? HTTP/(1\.0|1\.1|2\.0))|-|.+|)" (?P<status>\d{3}|-) (?P<sentBytes>[0-9]+) "(?P<HeaderReferer>.*?)" \"(?P<HeaderUserAgent>.*?)"$#"

Note the additional pattern ('.+') that's been appended to the end of the old request pattern. Will this work?

kassner commented 1 year ago

Hi @SAH62.

Thank you for the report. I've just merged #60 into master, so the PCRE pattern is updated any time setFormat or addPattern is called, regardless of the order.

As for your last question, on how to extract the extra text after the request line, I'd suggest creating a new pattern for the extra stuff that matches all but ":

<?php

namespace Kassner\Teste\LogParser\Issue;

use Kassner\LogParser\LogParser;

class Issue59Test extends \PHPUnit\Framework\TestCase
{
    public function testRequestGarbabe()
    {
        $parser = new LogParser('%h %l %u %t "%r%x" %>s %O "%{Referer}i" "%{User-Agent}i"');
        $parser->addPattern('%x', '(?<garbabe>[^\"]*)');

        $entry = $parser->parse('66.249.74.132 - - [10/Sep/2013:15:50:06 +0000] "GET /electronics/cameras/accessories/universal-camera-charger HTTP/1.1" 200 12347 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"');
        $this->assertEquals('', $entry->garbabe);
        $this->assertEquals('GET /electronics/cameras/accessories/universal-camera-charger HTTP/1.1', $entry->request);

        $entry = $parser->parse('66.249.74.132 - - [10/Sep/2013:15:50:06 +0000] "GET /electronics/cameras/accessories/universal-camera-charger HTTP/1.1some random garbage with \'single\' quotes" 200 12347 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"');
        $this->assertEquals('some random garbage with \'single\' quotes', $entry->garbabe);
        $this->assertEquals('GET /electronics/cameras/accessories/universal-camera-charger HTTP/1.1', $entry->request);
    }
}

I hope that works.

Thank you.