vezaynk / Sitemap-Generator-Crawler

PHP script to recursively crawl websites and generate a sitemap. Zero dependencies.
https://www.bbss.dev
MIT License
243 stars 93 forks source link

Fixed issue 82 - added noindex functionality #83

Open wcmohler opened 5 years ago

wcmohler commented 5 years ago

I also updated the priority functionality to tie the priority level to the depth, like other sitemap generators.

Added a space in the "Sitemap has been generated in" logger part of sitemap.php

wcmohler commented 5 years ago

I think I got this stuff resolved. Tested a couple of use cases and it seemed to work.

wcmohler commented 5 years ago

These are all great ideas, and thanks for the "welcome to contributing" note. :) It'll take a little bit of time for me to think about them and get them into place. Job priorities interfere with this fun stuff sometimes.

mylselgan commented 4 years ago

@wcmohler

It works as expected. but it should follow the links from noindex pages and add them to sitemap.xml file

example: page A have "noindex" meta Page A links to page B and Page C Page B and Page C don't have meta "noindex"

Result: Page A should be omitted but Page B and C should be added to the sitemap.xml file.

sidcha commented 4 years ago

@wcmohler @knyzorg what is the status of this PR? Do let me know if any work is needed to close this. I have some bandwidth to spare.

@mylselgan, what you want to achieve is pretty simple, instead of return $depth--;, you must set a flag $is_noindex_url and skip $map_row build/write sequence and allow rest of the method to execute. I created a patch to demonstrate this. See https://gist.github.com/cbsiddharth/27902a169a0f72a27d549653d1a3c47b. Beware, I haven't tested this change, let me know if you face some issues.

mylselgan commented 4 years ago

@cbsiddharth your patch works well. This PR now skips "noindex" pages and follows links from "noindex" pages. Thank you all.

vezaynk commented 3 years ago

Hi @sidcha,

Sorry I've been real busy with life, work and whatever else lately. GitHub sends notifications to the wrong e-mail for this repository so I end up missing them. It's been nearly a year, so I doubt you're still looking for an answer but for anyone who wanders here: The status of the PR is incomplete. It does what it sets out to do but still has unaddressed comments.

I would love for someone to pick up this PR (fork it), finish it up and send it in for merging. There is also a lot of room for improvement for how the noindex pages are both detected and processed. I would like to see something on that front before merging it in.