wangjunbao / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Extract article HTML from given HTML source? #58

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,

I know that the HTMLHighlighter can extract article HTML but only from 
TextDocument and URL.

I use HttpClient to retrieve HTML but I don't know how to contruct the 
TextDocument or other ways to extract the article HTML from it.

Please help!

Original issue reported on code.google.com by m...@ndthuan.com on 30 Nov 2012 at 8:44

GoogleCodeExporter commented 9 years ago
here is what i did

ArticleExtractor EXTRACTOR = ArticleExtractor.getInstance();
HTMLHighlighter HH = HTMLHighlighter.newExtractingInstance();

InputSource inputSource = new InputSource(new StringInputStream(html));
TextDocument htmlDoc = new BoilerpipeSAXInput(inputSource).getTextDocument();
EXTRACTOR.process(htmlDoc);
html = HH.process(htmlDoc, html);

Original comment by tien.ngu...@sematext.com on 28 Jun 2013 at 8:21