Closed mattiasgeniar closed 4 years ago
Maybe we could add a fallback if <head>
is not found within the first 1024 * 1024
bytes?
Something like this perhaps?
public static function create(string $source): self
{
$sourceFirstPart = substr($source, 0, 1 * 1024 * 1024);
if (strpos($sourceFirstPart, '</head>') !== false) {
$source = $sourceFirstPart;
}
return new self($source);
}
Edit: I think checking for </head>
is the correct thing to do, we want to make sure the whole <head>
block is contained in our subpart.
@mattiasgeniar change it like @brendt suggested, and we're good to go!
I like that approach too, but I see a couple of drawbacks. Let me revoke this PR and give it some more thought, I think there's a better fix possible.
Just a note: the head tag is optional in some cases:
<!DOCTYPE html>
<meta name="robots" content="noindex">
<title>Test</title>
<h1>Hello</h1>
This is a valid html document
This is a pragmatic approach to issue #19
The
RobotsMeta
class looks at the HTML source code to find HTML meta tags. This PR assumes that the<head>
is contained within the very first 1,048,576 characters (1024 * 1024
) of$html
which seems like a reasonable bet.A typical homepage will have the
head
section within the first 10,000 characters, the1024 * 1024
rule gives plenty of headroom without causing out of memory exceptions.