Closed GoogleCodeExporter closed 9 years ago
By default Abot is configured (in the config file) to download only certain
types of content...
downloadableContentTypes="text/html, text/plain"
Just add "text/xml" to that list...
downloadableContentTypes="text/html, text/plain, text/xml"
You should be good to go. Marking as fixed since this feature is already
present.
Original comment by sjdir...@gmail.com
on 30 May 2013 at 9:32
Thanks for the super fast reply. Given that I'm not at my dev box now, could
you tell me how abot will handle an xml file that does not have an href
preceding the url? In other words, the format of an xml sitemap just has
<loc>http://mysite.com/myfile.aspx</loc>
Original comment by scott.wh...@gmail.com
on 30 May 2013 at 9:49
Good point, the Hyperlink parser would need to be modified to pick up <loc>
tags. You could pull it off by doing something similar to....
1) Open the Abot.Core.HapHyperlinkParser.cs or extend it. This class uses
the popular dom library html agility pack to parse the links.
2) On line 23 and 24 you will see how abot uses html agility pack to get
all anchor tags with an href and area tags with an href
HtmlNodeCollection aTags = crawledPage.HtmlDocument.DocumentNode
.SelectNodes("//a[@href]");
HtmlNodeCollection areaTags = crawledPage.HtmlDocument.
DocumentNode.SelectNodes("//area[@href]");
3) Try adding something like the following to also add the loc tag
HtmlNodeCollection locTags = crawledPage.HtmlDocument.
DocumentNode.SelectNodes("//loc");
4) Modify GetLinks() method to extract the value inside the <loc> tag
instead of the ".Value"
I may add this capability if you would be willing to share your solution.
Original comment by sjdir...@gmail.com
on 31 May 2013 at 2:07
Oops - started working on my problem before I saw your reply. I ended up just
created a stylesheet that runs "over" the xml sitemap to convert it to html and
I point abot to the html file. Your way is much more elegant but my way "works"
and so I'm kind of happy with it at this point.
Here's the code to convert an xml sitemap to html using c# + asp.net if anyone
wants it. I don't claim it to be the fastest/best/optimized/etc - I'm on a dev
box and just needed "good enough"! I did create an effectively blank master
page to go with it - it was easier to do that than to figure out how to make a
page that's assigned a master not use that master haha.
<%@ Page Language="C#" Title="Sitemap in HTML format"
MasterPageFile="~/Templates/Web/blank.master" EnableViewState="false" %>
<%@ Import Namespace="System.IO" %>
<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.Xml" %>
<asp:Content ID="b1" ContentPlaceHolderID="BodyContainer" runat="server">
<div ID="xmlData" runat="server"></div>
</asp:Content>
<script runat="server">
private void Page_Load(object sender, EventArgs e)
{
string PostUrl = "http://localhost:1000/SiteMap.xml";
WebResponse webResponse = WebRequest.Create(PostUrl).GetResponse();
StreamReader sr = new StreamReader(webResponse.GetResponseStream());
String Result = sr.ReadToEnd().Trim();
XmlDocument xdoc = new XmlDocument(); xdoc.LoadXml(Result);
XmlNode docElement = xdoc.DocumentElement as XmlNode;
XmlNamespaceManager nsman = new XmlNamespaceManager(xdoc.NameTable);
// Have to give it a namespace or it can't find it:
// http://stackoverflow.com/questions/699184/basics-of-xmlnode-selectnodes
nsman.AddNamespace("a", docElement.NamespaceURI);
XmlNodeList nodes = xdoc.DocumentElement.SelectNodes("/a:urlset/a:url/a:loc", nsman);
StringBuilder sb = new StringBuilder();
foreach (XmlNode node in nodes)
{
sb.AppendLine();
sb.AppendFormat("<p><a href='{0}'>{0}</a></p>", node.InnerText); //node["loc"].InnerText);
}
xmlData.InnerHtml = sb.ToString();
}
</script>
Original comment by scott.wh...@gmail.com
on 31 May 2013 at 12:22
Original issue reported on code.google.com by
scott.wh...@gmail.com
on 30 May 2013 at 7:44