Open stephanros opened 5 years ago
Hi,
This is a sample to explain my issue : You can see on my sitemap : https://buen-polvo.es/sitemap.xml some URLs like "https://buen-polvo.es/miembro_motera1_2408242.html". But, if you open source code of this URL, you can see a Meta for bots with "noindex" :
This is inconsistent for Google because of we ask to Google Bot to index this URL (on sitemap), but when Google Bot try to analyse this URL, it see a noindex, so it can't index this URL.
I hope it's more clear now. Regards
Thanks, this is very helpful. I'm really short on time lately and I'm unlikely to be able to address this until late April.
Hopefully you'll manage until then.
Thanks. I'll try to manage it waiting your update.
Regards.
Hi,
In my side, I created a script which is cleaning bad url, but it's very slowly, so I never update my sitemap. Did you have time to look at this problem of noindex ?
Best regards.
I'll get on it in a few days.
Aaaand I'm done with final exams. Expect the patch this week :sunglasses:
Greaaaat 👍 I look forward to testing it :)
I'm sorry, but I can't find your patch.
It's a work in progress. I thought this would be easier. A major problem is that with links I only need to match a single attribute (href), with meta tags, I need to match both the name and content. It's tricky to get right.
A cheap that you can apply yourself is to simply check if the meta tag string is present in the html but hard-coding the check here: https://github.com/knyzorg/Sitemap-Generator-Crawler/blob/0b89cd5f53b02472d33131a2ebb62396003bf8df/sitemap.functions.php#L367
But my regular expression skills are somewhat rusty and regular expressions were never meant to parse html.
The entire project was written back for when PHP installations had finicky support for parsing HTML natively, and should have become unnecessary with the release of PHP7... yet here we are. I will eventually re-write as a binary with a proper HTML parser and deprecate the project.
@wcmohler is working on it in #83.
@knyzorg pull request #83 works as expected. but it should follow the links from noindex pages and add them to sitemap.xml file
example: page A have "noindex" meta Page A links to page B and Page C Page B and Page C don't have meta "noindex"
Result: Page A should be omitted but Page B and C should be added to the sitemap.xml file.
My sitemap contains a lot of URL that have a Meta "noindex". So webmastertools send me an alert.