vezaynk / Sitemap-Generator-Crawler

PHP script to recursively crawl websites and generate a sitemap. Zero dependencies.
https://www.bbss.dev
MIT License
243 stars 93 forks source link

Dropping pound #48

Closed ghost closed 7 years ago

ghost commented 7 years ago

[!] Checking #generateimagesitemap [!] Dropping pound. [!] generateimagesitemap is relative, convert to absolute [!] Result: https://www.2globalnomads.info/web-design-websites/generateimagesitemap

Pound (or hash mark) is a fragment identifier used in URLs to specify a specific location within an html page marked with an ID tag. If you drop out hash and try to crawl the URL like you do now, it will result a broken URL, wasted execution time and an entry in web servers error log. The correct way to handle hash is to ignore it, not just the hash sign but also the rest of the URL following hash.

https://www.2globalnomads.info/web-design-websites/#generateimagesitemap -> https://www.2globalnomads.info/web-design-websites/

vezaynk commented 7 years ago

wtf php

strtok() is weird.

strtok("foo#bar", "#"); //foo
strtok("#bar", "#"); //bar
ghost commented 7 years ago

strok() saves the string and when it's run twice (which it is), it will return the rest of the string after the token. So the real issue there is running the strtok() twice. You will notice that if you replace strtok() with preg_replace('/\#.*/', '', $href);

After that you see that the crawler is converting an empty string to absolute:

[!] Checking #generateimagesitemap [!] Dropping pound. [!] is relative, convert to absolute [!] Result: https://www.2globalnomads.info/web-design-websites/

vezaynk commented 7 years ago

Sounds good.

Would you care to make a PR with that?

ghost commented 7 years ago

That does not fix the issue, it just revealed it. I don't know why strtok() is run twice. I can change the function if you want, but it will not fix the underlying issue.

vezaynk commented 7 years ago

why do you think it is ran twice? I do not see that.

vezaynk commented 7 years ago

It messes up because strtok isn't doing what I expected to do and caused a bug.

ghost commented 7 years ago

It must, that is the only way why strtok() can act like it does.

vezaynk commented 7 years ago

strtok() acts like that because potato. Trying to come up with any other justification will come off as vulgar against the PHP devs, something that I rather avoid while writing under my own name.

The is how I intended for things to happen:

$href = "https://example.com/thispage#anchor";
$href = strtok($href, "#"); //https://example.com/thispage

And that worked fine. Until internal urls came along.

// What I expected
$href = "#anchor";
$href = strtok($href, "#"); // <emptystring>

// What I got
$href = "#anchor";
$href = strtok($href, "#"); // anchor

This is because strtok() seems to ignore the throw the first character in the garbage if it matches the delimiter. I did not expect this to happen, hence this bug.

Nothing here is running twice. However you have perfectly good login because if it were, hypothetically to run twice, we would have the exact same issue.

Your analyses of the symptom is correct, but the diagnostic is wrong. Thank you for your PR and regex which I will trust blindly because I do little more than pretend to understand them.

ghost commented 7 years ago

Thanks for figuring that out. To me it seems to work just like it should and there was just an issue in the the program logic that did not take in account that there could be an URL with only a fragment and nothing else. Anyway, seems that my regexp fix sorted that out.