Closed ghost closed 7 years ago
wtf php
strtok() is weird.
strtok("foo#bar", "#"); //foo
strtok("#bar", "#"); //bar
strok() saves the string and when it's run twice (which it is), it will return the rest of the string after the token. So the real issue there is running the strtok() twice. You will notice that if you replace strtok() with
preg_replace('/\#.*/', '', $href);
After that you see that the crawler is converting an empty string to absolute:
[!] Checking #generateimagesitemap [!] Dropping pound. [!] is relative, convert to absolute [!] Result: https://www.2globalnomads.info/web-design-websites/
Sounds good.
Would you care to make a PR with that?
That does not fix the issue, it just revealed it. I don't know why strtok() is run twice. I can change the function if you want, but it will not fix the underlying issue.
why do you think it is ran twice? I do not see that.
It messes up because strtok isn't doing what I expected to do and caused a bug.
It must, that is the only way why strtok() can act like it does.
strtok() acts like that because potato. Trying to come up with any other justification will come off as vulgar against the PHP devs, something that I rather avoid while writing under my own name.
The is how I intended for things to happen:
$href = "https://example.com/thispage#anchor";
$href = strtok($href, "#"); //https://example.com/thispage
And that worked fine. Until internal urls came along.
// What I expected
$href = "#anchor";
$href = strtok($href, "#"); // <emptystring>
// What I got
$href = "#anchor";
$href = strtok($href, "#"); // anchor
This is because strtok() seems to ignore the throw the first character in the garbage if it matches the delimiter. I did not expect this to happen, hence this bug.
Nothing here is running twice. However you have perfectly good login because if it were, hypothetically to run twice, we would have the exact same issue.
Your analyses of the symptom is correct, but the diagnostic is wrong. Thank you for your PR and regex which I will trust blindly because I do little more than pretend to understand them.
Thanks for figuring that out. To me it seems to work just like it should and there was just an issue in the the program logic that did not take in account that there could be an URL with only a fragment and nothing else. Anyway, seems that my regexp fix sorted that out.
Pound (or hash mark) is a fragment identifier used in URLs to specify a specific location within an html page marked with an ID tag. If you drop out hash and try to crawl the URL like you do now, it will result a broken URL, wasted execution time and an entry in web servers error log. The correct way to handle hash is to ignore it, not just the hash sign but also the rest of the URL following hash.
https://www.2globalnomads.info/web-design-websites/#generateimagesitemap -> https://www.2globalnomads.info/web-design-websites/