vezaynk / Sitemap-Generator-Crawler

PHP script to recursively crawl websites and generate a sitemap. Zero dependencies.
https://www.bbss.dev
MIT License
241 stars 92 forks source link

Blacklist not working #66

Closed Kristiansky closed 6 years ago

Kristiansky commented 6 years ago

I have added few pages from my site into the blacklist array, but despite that, they appear in the sitemap everytime.

Kristiansky commented 6 years ago

$blacklist = array( "/de/", "/de/*", "/private/", "/private/*", "*.jpg", "*.png", ); This is my blacklist array. When i open the xml file: `

https://www.mywebsite.com/de daily 1 https://www.mywebsite.com/de/sonnenschirme daily 1

`

vezaynk commented 6 years ago

$blacklist needs absolute urls. For /de/ you would either want https://website.com/de/ or */de/.

Kristiansky commented 6 years ago

I have changed the array as you told me: $blacklist = array( "https://www.mywebsite.com/de/", "https://www.mywebsite.com/de/*", "https://www.mywebsite.com/private/", "https://www.mywebsite.com/private/*", "*.jpg", "*.png", ); But in the sitemap /de links still appear. 😞

vezaynk commented 6 years ago

Post the full config, I'll take a look.

Kristiansky commented 6 years ago
<?php
/*
Sitemap Generator by Slava Knyazev. Further acknowledgements in the README.md file.

Website: https://www.knyz.org/
I also live on GitHub: https://github.com/knyzorg
Contact me: Slava@KNYZ.org
*/

//Make sure to use the latest revision by downloading from github: https://github.com/knyzorg/Sitemap-Generator-Crawler

/* Usage
Usage is pretty strait forward:
- Configure the crawler by editing this file.
- Select the file to which the sitemap will be saved
- Select URL to crawl
- Configure blacklists, accepts the use of wildcards (example: http://example.com/private/* and *.jpg)
- Generate sitemap
- Either send a GET request to this script or run it from the command line (refer to README file)
- Submit to Google
- Setup a CRON Job execute this script every so often

It is recommended you don't remove the above for future reference.
*/

// Default site to crawl
$site = "https://www.may-online.com/en";

// Default sitemap filename
$file = "../sitemap-generated.xml";
$permissions = 0644;

// Depth of the crawl, 0 is unlimited
$max_depth = 0;

// Show changefreq
$enable_frequency = true;

// Show priority
$enable_priority = true;

// Default values for changefreq and priority
$freq = "daily";
$priority = "1";

// Add lastmod based on server response. Unreliable and disabled by default.
$enable_modified = false;

// Disable this for misconfigured, but tolerable SSL server.
$curl_validate_certificate = true;

// The pages will be excluded from crawl and sitemap.
// Use for exluding non-html files to increase performance and save bandwidth.
$blacklist = array(
    "https://www.may-online.com/de/",
    "https://www.may-online.com/de/*",
    "https://www.may-online.com/private/",
    "https://www.may-online.com/private/*",
    "*.jpg",
    "*.png",
);

// Enable this if your site do requires GET arguments to function
$ignore_arguments = false;

// Not yet implemented. See issue #19 for more information.
$index_img = false;

//Index PDFs
$index_pdf = true;

// Set the user agent for crawler
$crawler_user_agent = "Mozilla/5.0 (compatible; Sitemap Generator Crawler; +https://github.com/knyzorg/Sitemap-Generator-Crawler)";

// Header of the sitemap.xml
$xmlheader ='<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">';

// Optionally configure debug options
$debug = array(
    "add" => true,
    "reject" => true,
    "warn" => true
);

//Modify only if configuration version is broken
$version_config = 2;
Kristiansky commented 6 years ago

Here's what's generated

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>https://www.may-online.com/de</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/impressum</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/agb</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/datenschutz</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/restaurant-cafe</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/ampelschirme</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/sonnenschirme/ampelschirme/mezzo</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
  <url>
    <loc>https://www.may-online.com/de/unternehmen/referenzen</loc>
    <changefreq>daily</changefreq>
    <priority>1</priority>
  </url>
</urlset>
vezaynk commented 6 years ago

Found the issue. Blacklist seems to be ignored when it caused by a redirect. Oh the pleasures of parsing the web!

By the way, to format code blocks, it's 3 backticks. The initial $site is trusted and is never checked against blacklists.

For some reason, it's refusing to go to the /en site. The reformatter chokes on it for some reason, probably somehow related to the redirection.

FYI, redirecting from the root is bad practice.

vezaynk commented 6 years ago

I was wrong. It is not related to the redirect. Your link looks as such: <a href=" https://www.may-online.com/en">en</a>. That is not okay. The space before the https:// makes it invalid. Web browsers are smart enough to remove it, my script is not.

vezaynk commented 6 years ago

I was suppose to close this issue via commit. Redirection bug was fixed in b8943622bb004d90c617fbac91d73d84cbdfdc68