rockdaboot / mget

Multithreaded metalink/file/website downloader (like Wget) and C library
GNU Lesser General Public License v3.0
112 stars 19 forks source link

support sitemap files #15

Closed rockdaboot closed 10 years ago

rockdaboot commented 10 years ago

Download sitemap urls from robots.txt (zipped and unzipped). Parse these files with Mgets XML parser to fetch all urls. Respect additional information/schemas from 'urlset', e.g.http://www.google.com/schemas/sitemap-image/1.1.

See http://www.sitemaps.org/protocol.html for more information.

rockdaboot commented 10 years ago

Mget now supports sitemap index files and sitemap files in 'sitemap' format (gzip compressed and uncompressed) and in plain text format. Snanning of RSS and Atom feed formats for sitemap files and within HTML will be supported soon.

rockdaboot commented 10 years ago

Added parsing RSS 2.0 and Atom 1.0 feeds.