johnbe4 / getSeoSitemap

PHP library to get the sitemap. It crawls a whole website checking all internal and external links plus a Search Engine Optimization.
Other
15 stars 5 forks source link
crawler generating-sitemaps google-sitemap php seo seo-optimization sitemap sitemap-files sitemap-generator sitemap-php sitemap-xml

getSeoSitemap v5.0.0 | 2023-02-27

PHP library to get sitemap.
It crawls a whole domain checking all URLs.
It makes Search Engine Optimization of URLs into sitemap only.

donate via paypal
donate via bitcoin
Please support this project by making a donation via PayPal or via BTC bitcoin to the address 19928gKpqdyN6CHUh4Tae1GW9NAMT6SfQH

Warning

Before moving from releases lower than 4.1.1 to 4.1.1 or higher, you must drop getSeoSitemap and getSeoSitemapExec tables into your dBase.

Overview
This script creates a full gzip sitemap or multiple gzip sitemaps plus a gzip sitemap index.
It includes change frequency, last modification date and priority setted following your own rules.
Change frequency will be automatically selected between daily, weekly, monthly and yearly.
Max URL lenght must be 767 characters, otherwise the script will fail.
Max page size must be 16777215 bytes, otherwise the script will fail.
URLs with http response code different from 200 or with size = 0 will not be included into sitemap.
It checks all internal and external links inside html pages and js sources (href URLs into 'a' tag plus form action URLs if method is get).
It checks all internal and external sources.
Mailto URLs will not be included into sitemap.
URLs inside pdf files will not be scanned and will not be included into sitemap.

getSeoSitemapBot is a crawler like Googlebot and it does not exec javascript.
That means it does not follow URLs created by javascript.
On https://support.google.com/webmasters/answer/2409684?hl=en Google says:
".....
Some features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash can make it difficult for search engines to crawl your site.
Check the following:
Use a text browser such as Lynx to examine your site, since many search engines see your site much as Lynx would.
If features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.
....."

To improve SEO following robots.txt rules of "User-agent: *", it checks:

You can use absolute or relative URLs inside the site.
This script will set automatically all URLs to skip and to allow into sitemap following the robots.txt rules of "User-agent: *" and robots tag into page head.
There is not any automatic function to submit updated sitemap to search engines.
Sitemap will be saved in the main directory of the domain.
It rewrites robots.txt adding updated sitemap informations.
Maximum limit of URLs to insert into sitemap is 2.5T.

Other main features:

Using getSeoSitemap, you will be able to give a better surfing experience to your clients.

Requirements

Instructions
1 - copy getSeoSitemap folder in a protected zone of your server.
2 - set all user parameters into config.php.
3 - on your server cronotab schedule the script once each day preferable when your server is not too much busy.
A command line example to schedule the script every day at 7:45:00 AM is:
45 7 * php /example/example/example/example/example/getSeoSitemap/getSeoSitemap.php
When you know how long it takes to execute all the script, you could add a cronotab timeout.

Warning
Before moving from releases lower than 4.1.1 to 4.1.1 or higher, you must drop getSeoSitemap and getSeoSitemapExec tables into your dBase.
Do not save any file with name that starts with sitemap in the main directory, otherwise getSeoSitemap script could cancel it.
The robots.txt file must be present into the main directory of the site otherwise getSeoSitemap will fail.
In case of FPM timeout errors, you should fix setting pm.process_idle_timeout to 30s or higher.
To run getSeoSitemap faster, using a script like Geoplugin you should exclude geoSeoSitemapBot user-agent from that.