nazuke / SEOMacroscope

SEO Macroscope is a website scanning tool, to check your website for broken links; including some technical SEO functionality, site scraping, Excel reporting, and more.
https://nazuke.github.io/SEOMacroscope/
GNU General Public License v3.0
229 stars 41 forks source link

Support for Cookies #31

Open stevenwaterman opened 5 years ago

stevenwaterman commented 5 years ago

It would be good if there was a way to set cookies for requests to allow for crawling sites that require authentication.

Is there currently a way to do this, or is this feature planned?

nazuke commented 5 years ago

Many thanks for the suggestion @motherlymuppet,

So far, I have not planned to support crawling sites that require form-based log ins yet. However, this would very likely be reasonably straightforward to add an option for.

One thing to bear in mind here, is that crawling a site with this type of log in may have unintended side-effects.

For example, if there are links that perform actions like "delete this page", or similar, then SEO Macroscope will merrily follow these links too.

This is also one of the reasons why GoogleBot et al will not crawl sites as a particular user.

nazuke commented 5 years ago

Hi @motherlymuppet, following up, I took a look at how Screaming Frog handles this situation.

They too include a dire warning about data loss when using forms-based log ins.

Cookie support itself may be fine though.

Do you happen to have an example site that absolutely requires the setting of cookies in order to crawl it properly please?

many thanks

stevenwaterman commented 5 years ago

The program should only be sending GET requests, surely? In which case there shouldn't be any effects on the site if it's configured properly and not changing state based on GET requests. I can see how that would be an issue for misconfigured sites though.

It'd be fine for it to be a very hidden option, it just seemed crazy that it wasn't there when it seems like a fairly fundamental part of accessing/navigating a website.

The site I wanted to use it on was my own, and authentication was enabled due to large amounts of sensitive information on the site, which was like a knowledge base. I was attempting to crawl the site to reduce the amount of duplicated information and reorganize the site to be more natural to navigate. I don't have an example to hand that you could use for testing, sorry.

nazuke commented 5 years ago

Thanks @motherlymuppet, that feedback helps a lot.

This is one of those cases where things in the real world, don't always match the specs. i.e. there will be some websites that will have regular links that have potentially damaging, to the user, side-effects when clicked. Generally, because these will always expect a human to be logged in, and not a robot that'll "click" everything it can get to on the page.

For example, SEO Macroscope would not know to not click this link:

<a href="/very/important/docs/delete/123">Delete this doc</a>

Under the hood, things are a little convoluted. The only HTTP methods used by the application are HEAD and GET.

In as many cases as possible, HEAD is used to probe a URL, with a subsequent GET where necessary.

You can see the rough flow that occurs for each fetched document here:

https://github.com/nazuke/SEOMacroscope/blob/master/SEOMacroscopeSeriesOne/src/MacroscopeDocument/MacroscopeDocument.cs

...in the public async Task<bool> Execute () method.

In fact, I just recently added an option to force GETs on web servers that don't service HEAD requests properly. The whole web is hack piled upon hack ;-)

So far, HTTP Basic Authentication should work in most cases; but as I don't get as much time as I'd like to work on this, forms-based authentication has so far not been on my TODO list. Hm, I don't actually have a forms-based authentication website to test with at the moment either...

You make some great points though, and this will be something that I'll be taking a look at soon.

many thanks!

nazuke commented 5 years ago

Hi again @motherlymuppet,

At a quick glance, it appears that cookie support itself is reasonably trivial.

So, the next detail would be the login process itself.

Does your login form use a GET like this:

https://www.company.com/login?username=bob&password=secret

or a POST to an endpoint somewhat like this:

https://www.company.com/login

with the credentials in the body?

If so, then this type of process would normally require the login page's URL and the credentials to be entered before the crawl takes place. Alternatively, a form field pattern would be required, with the credentials being prompted for during the crawl.

Either way, the login page would be requested first, in order for the resultant session cookie to be captured.

thanks!

stevenwaterman commented 5 years ago

It's a POST endpoint, but that shouldn't matter. What I had in mind was a simple text field in the option where you could paste the cookie. I don't expect SEO macroscope to navigate me to the login page or guide me through it or anything like that, and I'd prefer it didn't for security reasons.

I can use the login form myself in a web browser, then take the cookie from the developer menu. All you need to do then is provide the box to put the cookie into, and attach that cookie to all outgoing requests.

That would provide complete flexibility across all login methods, and anyone trying to solve this problem is probably advanced enough to go to the developer menu and grab a cookie.

I don't mean to be patronising if this is already obvious to you, but thought I'd give an example of what I mean:

benhadad commented 1 year ago

I have several websites that I own that require the acceptance of using cookies, this agreement is the "form" but it gives no more rights to the user except access to the website. This is now a very common use case in EU and now US. I just notice on these websites SEOMacroscope fails