spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Binary files are parsed, ignoring setParseableMimeTypes #369

Closed jespejoh closed 2 years ago

jespejoh commented 3 years ago

When setting both attributes, the crawler sometimes fails to ignore binary files (e.g. ZIP, video). I've pinned the error down to the __invoke() function on the CrawlRequestFulfilled:

    $body = $this->getBody($response);    <-- At this point $body is an empty string so the script continues but it fails at later stage.

    $robots = new CrawlerRobots(
        $response->getHeaders(),
        $body,
        $this->crawler->mustRespectRobots()
    );

    $crawlUrl = $this->crawler->getCrawlQueue()->getUrlById($index);

    if ($this->crawler->mayExecuteJavaScript()) {
        $body = $this->getBodyAfterExecutingJavaScript($crawlUrl->url);

        $response = $response->withBody(stream_for($body));
    }

I've managed to solve this issue locally by just adding this at the beginning of the function, so the crawling for this URL stops if the mime type is not supported:

    $contentType = $response->getHeaderLine('Content-Type');  

    if (! $this->isMimetypeAllowedToParse($contentType)) {
        return '';
    }

Not sure if it has any unexpected consequences but so far it works as expected. What do you think?

Happy to open a PR if you prefer it that way.

spatie-bot commented 2 years ago

Dear contributor,

because this issue seems to be inactive for quite some time now, I've automatically closed it. If you feel this issue deserves some attention from my human colleagues feel free to reopen it.

dominikfomicz commented 1 year ago

have same error