mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

How to get original links in html #30

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
in HTMLParser.java add the below method to do this, i posted this code if 
anyone have the same needs.

public Set<String> fetchOriginLinks(String htmlContent) {
        HashSet<String> originUrls = new HashSet<String>();
        char[] chars = htmlContent.toCharArray();
        linkExtractor.urls.clear();
        bulletParser.setCallback(linkExtractor);
        bulletParser.parse(chars);
        Iterator<String> it = linkExtractor.urls.iterator();

        int urlCount = 0;
        while (it.hasNext()) {
            String href = it.next();
            href = href.trim();
            if (href.length() == 0) {
                continue;
            }
            String hrefWithoutProtocol = href.toLowerCase();
            if (href.startsWith("http://")) {
                hrefWithoutProtocol = href.substring(7);
            }
            if (hrefWithoutProtocol.indexOf("javascript:") < 0
                    && hrefWithoutProtocol.indexOf("@") < 0) {
                originUrls.add(href);
                urlCount++;
                if (urlCount > MAX_OUT_LINKS) {
                    break;
                }
            }
        }
        linkExtractor.urls.clear();
        return originUrls;
    }

Original issue reported on code.google.com by wanxiang.xing@gmail.com on 9 Apr 2011 at 5:45

GoogleCodeExporter commented 9 years ago
Can you explain more on how to use this code, I did put it in the HTMLParser 
class, but how to actually make it work?

Original comment by H.Almere...@gmail.com on 29 Oct 2011 at 4:35

GoogleCodeExporter commented 9 years ago
It easy to use, something like show below:

    String _sFileName = "./minzu/index.htm";
        String htmlContent = CMyFile.readFile(_sFileName);
        System.out.println(store.filterLink(htmlContent));
......
        HTMLParser parse = new HTMLParser();
......

    protected String filterLink(String htmlContent) {
        Set<String> urls = parse.fetchOriginLinks(htmlContent);
        for (String aLink : urls) {
        .......

        }
        return htmlContent;
    }

Original comment by wanxiang.xing@gmail.com on 6 Nov 2011 at 9:43

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
This program is a dud. You can not get crawl links

Original comment by arijit.a...@gmail.com on 22 Oct 2013 at 10:30

GoogleCodeExporter commented 9 years ago
Not a bug or feature request

Original comment by avrah...@gmail.com on 11 Aug 2014 at 12:45