vezaynk / Sitemap-Generator-Crawler

PHP script to recursively crawl websites and generate a sitemap. Zero dependencies.
https://www.bbss.dev
MIT License
241 stars 92 forks source link

Add optional support for image indexing #19

Open vezaynk opened 7 years ago

vezaynk commented 7 years ago

Spec: https://support.google.com/webmasters/answer/178636

Example:

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
      <loc>http://example.com/sample.html</loc> 
      <image:image> 
        <image:loc>http://example.com/image.jpg</image:loc>
      </image:image> 
      <image:image>
        <image:loc>http://example.com/photo.jpg</image:loc>
      </image:image>
    </url>
</urlset> 

Google's sitemap image spec needs to be included.

Sent from my OnePlus ONE A2005 using FastHub

vezaynk commented 7 years ago

By lack of easier implementation, an img_index function will be called from within the scan_url function after failing the header check if the option is enabled.

vezaynk commented 7 years ago

Personal objective: Going to try to do it over the weekend

vezaynk commented 7 years ago

I have misjudged the extent of the effort. This opens it's own can of worms.

  1. Scan hrefs and imgs
  2. Identifying images
  3. Keeping track of context

This is not something I can do in a weekend. I am tempted to mark this as out-of-scope but it looks like a fun feature to try to implement. While I will never officially support it, I might probably do it.

With that said, PRs are welcome if anybody wants to do this themselves in the meantime!

ghost commented 7 years ago

Licenses are important for image sitemaps as there are no other feasible methods for communicating image licenses to search engines. If they are site wide, there could be a command line option for giving the license (an URL to actual license), like for example: --license http://creativecommons.org/publicdomain/zero/1.0/

Which would output to sitemap.xml inside \:

\http://creativecommons.org/publicdomain/zero/1.0/</image:license>

There are also some other tags in image sitemaps that could be read from the html if they are present including:

Video sitemaps are not very different from image sitemaps either, but here are a few more obligatory tags:

These could be crawled from the html, or if not present populated with placeholders.

After the image crawling is working, I am happy to offer the project an online environment I have already coded where people can generate image sitemaps.

I will take a look at your code later and see if I can throw in something more tangible than just ideas and testing.