Closed Scott-G closed 7 years ago
Could you make my life slightly easier and give me your production url so I can test directly against it?
Sure: http://www.poshnailandbeauty.co.uk
Note: Currently using PHPSitemapNG without any problems but looking for something a little more lightweight and more recently updated.
I would be surprised if it's the issue but some of your a[href]
s contain spaces. That is not valid html.
Yes, I am aware of those (they are SEO only and do not affect functionality but I find they rank better with spaces) but the problem occurs on links without them also.
I have found the crux of the issue. I am not quite understanding how hrefs are processed by browsers,
You have links like this: <a href="/pricelist.php?tid=132&tgid=24" ....
The script doesn't beat around the bush and sends a request to http://www.poshnailandbeauty.co.uk/pricelist.php?tid=132&tgid=24
. Browsers do not do that for some reason. Instead they parse the &
and replace it with a &
. The request they send is then to http://www.poshnailandbeauty.co.uk/pricelist.php?tid=132&tgid=24
.
I need to dig up some documentation as to what is the correct behavior.
This guy seems to know what he is talking about: https://stackoverflow.com/a/3705601/4088472
Browsers parse it correctly, I do not. Patch coming in a few minutes!
As far as I am aware, ampersand should be encoded to &
when used in urls?
Edit: Ignore this, your above reply must have crossed this one.
So.. I think I got it right. Re-open if it is still an issue. Open another issue if it broke something else.
Fantastic, works great. I had to remove the space from the regex to allow for my technically non compliant urls but that isn't your issue. Thanks for the quick resolution!
Ah! I forgot about the space thing. Great to hear that you figured it out. Parsing html with regex has a tendency to summon Cthulhu so I avoid touching it.
I promise you though that that space was there for a reason and can potential break stuff when removed. I recommend that you compare your old sitemaps to mine to be safe.
When crawling a site with multiple URL parameters, the ampersand is incorectly recorded in the url leading to the page being recorded with ever deepening levels of 'amp;' in the url. Example output:
Added: http://www.example.co.uk/pricelist.php?tid=132&amp;amp;amp;amp;amp;amp;tgid=24