vezaynk / Sitemap-Generator-Crawler

PHP script to recursively crawl websites and generate a sitemap. Zero dependencies.
https://www.bbss.dev
MIT License
243 stars 93 forks source link

"noindex" URL are listed in sitemap #82

Open stephanros opened 5 years ago

stephanros commented 5 years ago

My sitemap contains a lot of URL that have a Meta "noindex". So webmastertools send me an alert.

stephanros commented 5 years ago

Hi,

This is a sample to explain my issue : You can see on my sitemap : https://buen-polvo.es/sitemap.xml some URLs like "https://buen-polvo.es/miembro_motera1_2408242.html". But, if you open source code of this URL, you can see a Meta for bots with "noindex" :

This is inconsistent for Google because of we ask to Google Bot to index this URL (on sitemap), but when Google Bot try to analyse this URL, it see a noindex, so it can't index this URL.

I hope it's more clear now. Regards

vezaynk commented 5 years ago

Thanks, this is very helpful. I'm really short on time lately and I'm unlikely to be able to address this until late April.

Hopefully you'll manage until then.

stephanros commented 5 years ago

Thanks. I'll try to manage it waiting your update.

Regards.

stephanros commented 5 years ago

Hi,

In my side, I created a script which is cleaning bad url, but it's very slowly, so I never update my sitemap. Did you have time to look at this problem of noindex ?

Best regards.

vezaynk commented 5 years ago

I'll get on it in a few days.

vezaynk commented 5 years ago

Aaaand I'm done with final exams. Expect the patch this week :sunglasses:

stephanros commented 5 years ago

Greaaaat 👍 I look forward to testing it :)

stephanros commented 5 years ago

I'm sorry, but I can't find your patch.

vezaynk commented 5 years ago

It's a work in progress. I thought this would be easier. A major problem is that with links I only need to match a single attribute (href), with meta tags, I need to match both the name and content. It's tricky to get right.

vezaynk commented 5 years ago

A cheap that you can apply yourself is to simply check if the meta tag string is present in the html but hard-coding the check here: https://github.com/knyzorg/Sitemap-Generator-Crawler/blob/0b89cd5f53b02472d33131a2ebb62396003bf8df/sitemap.functions.php#L367

But my regular expression skills are somewhat rusty and regular expressions were never meant to parse html.

The entire project was written back for when PHP installations had finicky support for parsing HTML natively, and should have become unnecessary with the release of PHP7... yet here we are. I will eventually re-write as a binary with a proper HTML parser and deprecate the project.

vezaynk commented 5 years ago

@wcmohler is working on it in #83.

mylselgan commented 4 years ago

@knyzorg pull request #83 works as expected. but it should follow the links from noindex pages and add them to sitemap.xml file

example: page A have "noindex" meta Page A links to page B and Page C Page B and Page C don't have meta "noindex"

Result: Page A should be omitted but Page B and C should be added to the sitemap.xml file.