Closed GoogleCodeExporter closed 9 years ago
Hello Crawler4J Team. I faced this bug using crawler4j and I managed to solve
this. Please consider the following changes. This is a great library by the
way.
Original comment by anbini...@gmail.com
on 8 Apr 2012 at 7:37
Attachments:
[deleted comment]
you can get null if you have <img> tag inside your anchor tag.
For example
<a href=""> <img src="" /> </a>
will return a null. In my case this is why I was getting nulls.
Original comment by smsa...@gmail.com
on 2 Aug 2012 at 3:09
I got this null all the time, when i invoke it from visit(), as
Alexey.R...@gmail.com had said.
Original comment by wangshuimail@gmail.com
on 7 Aug 2012 at 12:18
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
Go to the HtmlContentHandler Class and Follow the instructions :
========================================================================
First : change the anchorText variable into String variable Like this :
========================================================================
private String anchorText = "";
============================================
Second : modify the constructor like this :
============================================
public HtmlContentHandler()
{
isWithinBodyElement = true;
bodyText = new StringBuilder();
outgoingUrls = new ArrayList<ExtractedUrlAnchorPair>();
}
=======================================================
Third : Modify the characters procedure Like this :
=======================================================
public void characters(char ch[], int start, int length) throws SAXException {
if (isWithinBodyElement)
{
bodyText.append(ch, start, length);
if (anchorFlag)
{
anchorText=new String(ch, start, length).replaceAll("\n", "").replaceAll("\t", "").trim();
}
}
}
Then Enjoy the anchor Text . :)
Original comment by martin.a...@gmail.com
on 28 Sep 2012 at 12:57
I modified the code exactly as the above post to HtmlContentHandler Class (with
one additional modification that I have to add for anchorText.delete to
anchorText="" to avoid a syntax error).
Nothing happens. Still got "null".
Frustrated...
Original comment by wangshuimail@gmail.com
on 4 Oct 2012 at 1:16
At LAST! with the WebURLTupleBinding.java provided in the above and the tweaks
of the the HtmlContentHandler Class, I finally get this to work.
note that you still need to fix an error in HtmlContentHandler.endElement():
change the statement:
anchorText.delete(0, anchorText.length());
to
anchorText = "";
(or just delete it).
Original comment by wangshuimail@gmail.com
on 5 Oct 2012 at 6:33
[deleted comment]
[deleted comment]
This issue was closed by revision c697a985f583.
Original comment by ganjisaffar@gmail.com
on 3 Mar 2013 at 8:05
Original issue reported on code.google.com by
Alexey.R...@gmail.com
on 4 Apr 2012 at 2:00