Use some external HTML parser to parse the pages

sageone / spider

Automatically exported from code.google.com/p/spider

0 stars 0 forks source link

Use some external HTML parser to parse the pages #7

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

Study and search [apache?] for a HTML parsers out there to obtain benefit of 
it's usage, mainly in the link-extraction and other HTML processing.

Original issue reported on code.google.com by david.fr...@gmail.com on 6 Jul 2011 at 6:07

GoogleCodeExporter commented 9 years ago

Evaluate jsoup (http://jsoup.org/)

Original comment by david.fr...@gmail.com on 5 Aug 2011 at 6:14

GoogleCodeExporter commented 9 years ago

Dependency for jsoup:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.6.1</version>
</dependency>

Original comment by david.fr...@gmail.com on 5 Aug 2011 at 6:16

GoogleCodeExporter commented 9 years ago

Evaluate jericho (http://jericho.htmlparser.net/docs/index.html)
<dependency>
    <groupId>net.htmlparser.jericho</groupId>
    <artifactId>jericho-html</artifactId>
    <version>3.2</version>
</dependency>

Original comment by david.fr...@gmail.com on 5 Aug 2011 at 6:31

GoogleCodeExporter commented 9 years ago

Evaluate rome http://java.net/projects/rome/
for RSS parsing.
JSoup evaluated and used for HTML (commint pendant), no suitable for RSS so 
need to use an alternative.
<dependency>
    <groupId>rome</groupId>
    <artifactId>rome</artifactId>
    <version>1.0</version>
</dependency>

Original comment by david.fr...@gmail.com on 30 Aug 2011 at 6:48