vezaynk / Sitemap-Generator-Crawler

PHP script to recursively crawl websites and generate a sitemap. Zero dependencies.
https://www.bbss.dev
MIT License
243 stars 93 forks source link

Ampersand encoding fails with multiple URL parameters #23

Closed Scott-G closed 7 years ago

Scott-G commented 7 years ago

When crawling a site with multiple URL parameters, the ampersand is incorectly recorded in the url leading to the page being recorded with ever deepening levels of 'amp;' in the url. Example output: Added: http://www.example.co.uk/pricelist.php?tid=132&tgid=24

vezaynk commented 7 years ago

Could you make my life slightly easier and give me your production url so I can test directly against it?

Scott-G commented 7 years ago

Sure: http://www.poshnailandbeauty.co.uk

Note: Currently using PHPSitemapNG without any problems but looking for something a little more lightweight and more recently updated.

vezaynk commented 7 years ago

I would be surprised if it's the issue but some of your a[href]s contain spaces. That is not valid html.

Scott-G commented 7 years ago

Yes, I am aware of those (they are SEO only and do not affect functionality but I find they rank better with spaces) but the problem occurs on links without them also.

vezaynk commented 7 years ago

I have found the crux of the issue. I am not quite understanding how hrefs are processed by browsers,

You have links like this: <a href="/pricelist.php?tid=132&amp;tgid=24" ....

The script doesn't beat around the bush and sends a request to http://www.poshnailandbeauty.co.uk/pricelist.php?tid=132&amp;tgid=24. Browsers do not do that for some reason. Instead they parse the &amp; and replace it with a &. The request they send is then to http://www.poshnailandbeauty.co.uk/pricelist.php?tid=132&tgid=24.

I need to dig up some documentation as to what is the correct behavior.

vezaynk commented 7 years ago

This guy seems to know what he is talking about: https://stackoverflow.com/a/3705601/4088472

Browsers parse it correctly, I do not. Patch coming in a few minutes!

Scott-G commented 7 years ago

As far as I am aware, ampersand should be encoded to &amp; when used in urls?

Edit: Ignore this, your above reply must have crossed this one.

vezaynk commented 7 years ago

So.. I think I got it right. Re-open if it is still an issue. Open another issue if it broke something else.

Scott-G commented 7 years ago

Fantastic, works great. I had to remove the space from the regex to allow for my technically non compliant urls but that isn't your issue. Thanks for the quick resolution!

vezaynk commented 7 years ago

Ah! I forgot about the space thing. Great to hear that you figured it out. Parsing html with regex has a tendency to summon Cthulhu so I avoid touching it.

I promise you though that that space was there for a reason and can potential break stuff when removed. I recommend that you compare your old sitemaps to mine to be safe.