mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Impossible to get anchor text in visit(Page page) #143

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
String url = page.getWebURL().getURL();
String anchor = page.getWebURL().getAnchor();
System.out.println("URL: " + url + " -> " + anchor);

anchor is always null

Original issue reported on code.google.com by Alexey.R...@gmail.com on 4 Apr 2012 at 2:00

GoogleCodeExporter commented 9 years ago
Hello Crawler4J Team. I faced this bug using crawler4j and I managed to solve 
this. Please consider the following changes. This is a great library by the 
way. 

Original comment by anbini...@gmail.com on 8 Apr 2012 at 7:37

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
you can get null if you have <img> tag inside your anchor tag. 
For example 
      <a href=""> <img src="" /> </a>
will return a null. In my case this is why I was getting nulls.

Original comment by smsa...@gmail.com on 2 Aug 2012 at 3:09

GoogleCodeExporter commented 9 years ago
I got this null all the time, when i invoke it from visit(), as  
Alexey.R...@gmail.com had said.

Original comment by wangshuimail@gmail.com on 7 Aug 2012 at 12:18

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Go to the  HtmlContentHandler Class and Follow the instructions :

========================================================================
First : change the anchorText variable into String variable Like this :
========================================================================

private String  anchorText = "";

============================================
Second : modify the constructor like this :
============================================

public HtmlContentHandler() 
        {
        isWithinBodyElement = true;
        bodyText = new StringBuilder();
        outgoingUrls = new ArrayList<ExtractedUrlAnchorPair>();
    }

=======================================================
Third : Modify the  characters   procedure Like this  :
=======================================================

public void characters(char ch[], int start, int length) throws SAXException {
        if (isWithinBodyElement) 
                {
            bodyText.append(ch, start, length);

            if (anchorFlag) 
                        {
                anchorText=new String(ch, start, length).replaceAll("\n", "").replaceAll("\t", "").trim();
            }
        }
    }

Then Enjoy the anchor Text .    :)

Original comment by martin.a...@gmail.com on 28 Sep 2012 at 12:57

GoogleCodeExporter commented 9 years ago
I modified the code exactly as the above post to HtmlContentHandler Class (with 
one additional modification that I have to add for anchorText.delete to 
anchorText="" to avoid a syntax error).

Nothing happens. Still got "null".

Frustrated...

Original comment by wangshuimail@gmail.com on 4 Oct 2012 at 1:16

GoogleCodeExporter commented 9 years ago
At LAST! with the WebURLTupleBinding.java provided in the above and the tweaks 
of the the  HtmlContentHandler Class, I finally get this to work.

note that you still need to fix an error in HtmlContentHandler.endElement(): 
change the statement:
anchorText.delete(0, anchorText.length());
to 
anchorText = "";
(or just delete it).

Original comment by wangshuimail@gmail.com on 5 Oct 2012 at 6:33

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
This issue was closed by revision c697a985f583.

Original comment by ganjisaffar@gmail.com on 3 Mar 2013 at 8:05