spatie / robots-txt

Determine if a page may be crawled from robots.txt, robots meta tags and robot headers
https://spatie.be/en/opensource/php
MIT License
219 stars 36 forks source link

Discussion: Would it make sense to replace file_get_contents with guzzle #40

Closed ivangrozni closed 1 year ago

ivangrozni commented 1 year ago

Hello,

we've encountered an issue with our robots.txt checks. For sites that are not live yet, we expect the checks to fail, but there is no way to set the timeout and these checks take really long time to finish (in the meantime our monitoring system OhDear fails to fetch healtcheck results because response-time is around 30s). But there is no nice way to lower the response time for only these checks (at least not one that I'm aware off).

I know there is a way to set timeout on guzzle, maybe it is on other php http clients as well, but guzzle is the only one I'm familiar with. curl would be another option since it allows setting timeout as well.

Regards Lio

freekmurze commented 1 year ago

I agree that this could/should be more flexible. Currently I have no time for handling this.

I'd be open for a PR that solves this:

ivangrozni commented 1 year ago

@freekmurze thank you for your quick response. I've played around with replacing file_get_contents with curl but I was not able to achieve any progress due to https://github.com/curl/curl/issues/9272. No matter the timeout I set, all failed requests to robots-txt were taking nearly 5 seconds.

spatie-bot commented 1 year ago

Dear contributor,

because this issue seems to be inactive for quite some time now, I've automatically closed it. If you feel this issue deserves some attention from my human colleagues feel free to reopen it.