spatie / robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
https://spatie.be/en/opensource/php
MIT License
219 stars 36 forks source link

Prevent parsing of large HTML blobs to reach out-of-memory exceptions #20

Closed mattiasgeniar closed 4 years ago

mattiasgeniar commented 4 years ago

This is a pragmatic approach to issue #19

The RobotsMeta class looks at the HTML source code to find HTML meta tags. This PR assumes that the <head> is contained within the very first 1,048,576 characters (1024 * 1024) of $html which seems like a reasonable bet.

A typical homepage will have the head section within the first 10,000 characters, the 1024 * 1024 rule gives plenty of headroom without causing out of memory exceptions.

brendt commented 4 years ago

Maybe we could add a fallback if <head> is not found within the first 1024 * 1024 bytes?

brendt commented 4 years ago

Something like this perhaps?

public static function create(string $source): self
{
    $sourceFirstPart = substr($source, 0, 1 * 1024 * 1024);

    if (strpos($sourceFirstPart, '</head>') !== false) {
        $source = $sourceFirstPart;
    }

    return new self($source);
}

Edit: I think checking for </head> is the correct thing to do, we want to make sure the whole <head> block is contained in our subpart.

freekmurze commented 4 years ago

@mattiasgeniar change it like @brendt suggested, and we're good to go!

mattiasgeniar commented 4 years ago

I like that approach too, but I see a couple of drawbacks. Let me revoke this PR and give it some more thought, I think there's a better fix possible.

willemwollebrants commented 4 years ago

Just a note: the head tag is optional in some cases:

<!DOCTYPE html>
<meta name="robots" content="noindex">
<title>Test</title>
<h1>Hello</h1>
This is a valid html document