spatie / robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
https://spatie.be/en/opensource/php
MIT License
219 stars 36 forks source link

Out of Memory exception on large pieces of HTML #19

Closed mattiasgeniar closed 4 years ago

mattiasgeniar commented 4 years ago

An interesting thing happens when you read very large contents of HTML and apply the findRobotsMetaTagLine($html) method: it runs out of memory.

Allowed memory size of 268435456 bytes exhausted

The problem occurs here:

$lines = explode(PHP_EOL, $html);

But the real issue is the $html variable, which might be several hundred thousand lines long.

My initial reaction was: I'll just read that string in chunks. There's fread for files, but not for strings.

What's the safest way to read a string in chunks to avoid out of memory errors?

(I really want to avoid looping a string with $html[0], $html[1], ... )

pieterbeulque commented 4 years ago

My first attempt would be to loop over the string with strpos:

while ($newline = strpos($html, PHP_EOL) !== false) {
  $lines[] = substr($html, 0, $newline);
  $html = substr($html, $newline);
}

But I think that could be significantly slower.

fntneves commented 4 years ago

What about using stream resources? Write it to a temp file and then use stream functions (i.e., fgets).

Your bottleneck is memory, here. Working with long strings in memory will eventually push memory to its limit. Also, adjusting PHP's memory will not allow you to handle any string. But using stream resources will.

gopalkumar315 commented 4 years ago

Will increase memory size.

jerodev commented 4 years ago

Would it be possible to rewrite the class using resources instead of reading the file as a string?

willemwollebrants commented 4 years ago

I would use a stream in a generator:

    protected function findRobotsMetaTagLine(string $html): ?string
    {
        function readLineFromStream(string $str)
        {
            $stream = fopen('php://memory', 'r+');
            fwrite($stream, $str);
            rewind($stream);

            while (($line = fgets($stream)) !== false) {
                yield($line);
            }
        }

        foreach (readLineFromStream($html) as $line) {
            if (strpos(strtolower(trim($line)), '<meta name="robots"') === 0) {
                return $line;
            }
        }

        return null;
    }
mattiasgeniar commented 4 years ago

@willemwollebrants damn that's clever code, hadn't thought of that yet!

mattiasgeniar commented 4 years ago

Decided to fix this in the implementation of RobotsMeta instead of the actual class itself, will close for now.

spatie-bot commented 4 years ago

Dear contributor,

because this issue seems to be inactive for quite some time now, I've automatically closed it. If you feel this issue deserves some attention from my human colleagues feel free to reopen it.