spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
https://freek.dev/308-building-a-crawler-in-php
MIT License
2.51k stars 357 forks source link

Make setParseableMimeTypes prevent URLs from being crawled #432

Closed arturzealves closed 1 year ago

arturzealves commented 1 year ago

Some context

I started using this package yesterday and got the same problem that was also reported on issue https://github.com/spatie/crawler/issues/352

This is probably not the best way to fix the issue, but I don't have enough knowledge and time to do it properly right now. So if this pull request is not good enough, point me in the right direction or feel free to continue this work, so that we can have a mergeable pull request sooner.

I didn't add new tests because I wasn't able to have the test server running.

Description

What I did was to check if the response of getBody was an empty string and do an early return if it is. The main purpose was to avoid the execution of line#41 and line#50.

I was a bit afraid that maybe the ArrayCrawlQueue may start to have pending URLs, because line#38 is not being executed anymore, but I did some debug while testing manually and with this change, there are no pending URLs in the end.

How to reproduce

The code I had running on my computer that made me find this issue comes from this page. You will have to uncomment the line related with the setParseableMimeTypes of course. On that code example, it will crawl on https://www.lipsum.com and with this fix, it won't crawl URLs of .pdf or .gz files.

Screenshot 2022-10-19 at 23 20 30

As seen above, in my example, there are no log messages saying crawled of URLs that point to files

freekmurze commented 1 year ago

Thanks for your work on this. I'm considering merging this in, but the tests need to pass.

Could you rebase with the main branch to allow those composer plugins to run?

spatie-bot commented 1 year ago

Dear contributor,

because this pull request seems to be inactive for quite some time now, I've automatically closed it. If you feel this pull request deserves some attention from my human colleagues feel free to reopen it.

XAKEPEHOK commented 7 months ago

Why this was not merged yet?