sethia4u / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Suggestion: Allow to crawl basied on an xml sitemap #108

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
My own reasons for wanting to build a crawler is to test site links before I 
post major changes to a live site. As such, I want to test the sitemap, which 
is in XML format (this format specifically: http://www.sitemaps.org/). 

Current version of abot (1.1) returns "Page has no content" when it encounters 
xml files. It would be nice if it would parse each link int the xml site map 
file.

There are also many versions of XSL stylesheets for XML sitemaps (sort of like 
this: http://yoast.com/xsl-stylesheet-xml-sitemap/). These make it more "human 
friendly" for admins to view what's in the sitemap in an easy to consume way. 

Original issue reported on code.google.com by scott.wh...@gmail.com on 30 May 2013 at 7:44

GoogleCodeExporter commented 9 years ago
By default Abot is configured (in the config file) to download only certain 
types of content...

downloadableContentTypes="text/html, text/plain"

Just add "text/xml" to that list...

downloadableContentTypes="text/html, text/plain, text/xml"

You should be good to go. Marking as fixed since this feature is already 
present.

Original comment by sjdir...@gmail.com on 30 May 2013 at 9:32

GoogleCodeExporter commented 9 years ago
Thanks for the super fast reply. Given that I'm not at my dev box now, could 
you tell me how abot will handle an xml file that does not have an href 
preceding the url? In other words, the format of an xml sitemap just has 
<loc>http://mysite.com/myfile.aspx</loc>

Original comment by scott.wh...@gmail.com on 30 May 2013 at 9:49

GoogleCodeExporter commented 9 years ago
Good point, the Hyperlink parser would need to be modified to pick up <loc>
tags. You could pull it off by doing something similar to....

1) Open the Abot.Core.HapHyperlinkParser.cs or extend it. This class uses
the popular dom library html agility pack to parse the links.

2) On line 23 and 24 you will see how abot uses html agility pack to get
all anchor tags with an href and area tags with an href

            HtmlNodeCollection aTags = crawledPage.HtmlDocument.DocumentNode
.SelectNodes("//a[@href]");
            HtmlNodeCollection areaTags = crawledPage.HtmlDocument.
DocumentNode.SelectNodes("//area[@href]");

3) Try adding something like the following to also add the loc tag

            HtmlNodeCollection locTags = crawledPage.HtmlDocument.
DocumentNode.SelectNodes("//loc");

4) Modify GetLinks() method to extract the value inside the <loc> tag
instead of the ".Value"

I may add this capability if you would be willing to share your solution.

Original comment by sjdir...@gmail.com on 31 May 2013 at 2:07

GoogleCodeExporter commented 9 years ago
Oops - started working on my problem before I saw your reply. I ended up just 
created a stylesheet that runs "over" the xml sitemap to convert it to html and 
I point abot to the html file. Your way is much more elegant but my way "works" 
and so I'm kind of happy with it at this point.

Here's the code to convert an xml sitemap to html using c# + asp.net if anyone 
wants it. I don't claim it to be the fastest/best/optimized/etc - I'm on a dev 
box and just needed "good enough"! I did create an effectively blank master 
page to go with it - it was easier to do that than to figure out how to make a 
page that's assigned a master not use that master haha. 

<%@ Page Language="C#" Title="Sitemap in HTML format" 
MasterPageFile="~/Templates/Web/blank.master" EnableViewState="false" %>

<%@ Import Namespace="System.IO" %>
<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.Xml" %>
<asp:Content ID="b1" ContentPlaceHolderID="BodyContainer" runat="server">
    <div ID="xmlData" runat="server"></div>
</asp:Content>
<script runat="server">
    private void Page_Load(object sender, EventArgs e)
    {
        string PostUrl = "http://localhost:1000/SiteMap.xml";
        WebResponse webResponse = WebRequest.Create(PostUrl).GetResponse();
        StreamReader sr = new StreamReader(webResponse.GetResponseStream());

        String Result = sr.ReadToEnd().Trim();

        XmlDocument xdoc = new XmlDocument(); xdoc.LoadXml(Result);
        XmlNode docElement = xdoc.DocumentElement as XmlNode;
        XmlNamespaceManager nsman = new XmlNamespaceManager(xdoc.NameTable);

        // Have to give it a namespace or it can't find it:
        // http://stackoverflow.com/questions/699184/basics-of-xmlnode-selectnodes
        nsman.AddNamespace("a", docElement.NamespaceURI);

        XmlNodeList nodes = xdoc.DocumentElement.SelectNodes("/a:urlset/a:url/a:loc", nsman);

        StringBuilder sb = new StringBuilder();

        foreach (XmlNode node in nodes)
        {
            sb.AppendLine();
            sb.AppendFormat("<p><a href='{0}'>{0}</a></p>", node.InnerText); //node["loc"].InnerText);
        }

        xmlData.InnerHtml = sb.ToString();

    }
</script>

Original comment by scott.wh...@gmail.com on 31 May 2013 at 12:22