spekulatius / PHPScraper

A universal web-util for PHP.
https://phpscraper.de
GNU General Public License v3.0
543 stars 74 forks source link

get http status code #161

Open eposjk opened 1 year ago

eposjk commented 1 year ago

How do I check if the scraped page is an error page? It would be helpful to have the http status for that - something like

$web->go('https://httpstat.us/404');
echo $web->status;     // prints 404

Just before posting, I found the solution myself:

$web->client->getResponse()->getStatusCode()

It would be great to have this added to the documentation. Or maybe add $web->status as shortcut?

eposjk commented 1 year ago

Thanks!

Concerning your idea: doesn't $web->is404 and others lead to bad (unsafe) code which just checks specific codes? better would be

if($web->isSuccess) {     // or $web->isOk ? or $web->is2xx ?
     // process data (was 200 OK or other 2xx)
} elseif($web->isServerError) {     // or $web->is5xx ?
     // repeat request later (was 500 Internal Server Error, 503 Service Unavailable, 504 Gateway Timeout or other 5xx)
} else {     // we might offer something like $web->isClientError or $web->is4xx - but the user should not forget to handle all error codes!
     // bad url (was 404 Not Found, 410 Gone, 451 Unavailable For Legal Reasons or other 4xx or unknown status codes)
}
spekulatius commented 1 year ago

Oh, sorry, I've removed it as I realized I need to think it through more and play with some code. Your idea with grouping it as is2xx, is4xx, is5xx is solving a question I've had: how to make an organized decision based on it. I'll open a PR to have something to talk about it shortly

spekulatius commented 1 year ago

I've add some ideas for status code related methods: https://github.com/spekulatius/PHPScraper/pull/162

Let me know what you think @eposjk :)

eposjk commented 1 year ago

Well - technically it works, but does it really make sense to add different functions which do the same? In addition, there isn't an is(NumericStatusCode) function for every common status code. So it gets confusing - the developer has to know that he could use isNotFound/is404 but not isGone/is410.

My personal opinion would be to only keep the is2xx/is4xx/is5xx (and maybe is3xx - if there is a way not to auto-follow redirects) - maybe renaming them to isSuccess(2xx)/isServerError(5xx)/isClientError(4xx) - who really needs to check for specific codes and knows what he's doing should do that with e.g. $web->statusCode===404

By the way: The way to check if any error occured would be !$web->isSuccess - just $web->isServerError || $web->isClientError is NOT enought! Web crawlers should be able to deal even with exotic HTTP status codes and treat them as errors (e.g. 9xx).

spekulatius commented 1 year ago

Well - technically it works, but does it really make sense to add different functions which do the same? In addition, there isn't an is(NumericStatusCode) function for every common status code. So it gets confusing - the developer has to know that he could use isNotFound/is404 but not isGone/is410.

Yeah, my first thought was to allow flexibility. But for keeping the interface simple falling back on a simple === check sounds good. I'll adjust it shortly.

My personal opinion would be to only keep the is2xx/is4xx/is5xx (and maybe is3xx - if there is a way not to auto-follow redirects) - maybe renaming them to isSuccess(2xx)/isServerError(5xx)/isClientError(4xx) - who really needs to check for specific codes and knows what he's doing should do that with e.g. $web->statusCode===404

Yeah, this would allow to break it down into categories for internal handling. As you've mentioned, detailed handling can be done on the statusCode itself.

By the way: The way to check if any error occured would be !$web->isSuccess - just $web->isServerError || $web->isClientError is NOT enought! Web crawlers should be able to deal even with exotic HTTP status codes and treat them as errors (e.g. 9xx).

The 600+ ranges seem to be used for various purposes atm. Some are caching-related and others are errors. What are you thinking how these should be handled?

eposjk commented 1 year ago

Its more complicated than I thought - I gave it a try: https://github.com/eposjk/PHPScraper/tree/status-codes

New insights:

see some demo code at https://github.com/eposjk/PHPScraper/blob/status-codes/demo.php

In addition, I would remove isClientError(), isServerError(), isNotFound() and isForbidden(). (At least isNotFound() I consider as harmful - people probably will use it to check if a page is intentional unavailable. They probably will forget to check for other status codes which also might be used for intentional unavailable content - 410 Gone, 401 Unauthorized and 403 Forbidden)

What is still missing:

spekulatius commented 1 year ago

Hey @eposjk

I've open a PR for this: #164. Let's work out some practical details with this.

Cheers, Peter