rszaloki / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Outputs html instead of plain text for certain urls #30

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
To reproduce the problem
1. Apply ArticleExtractor to 
http://fahadbangladesh.blogspot.com/feeds/posts/default?orderby=updated
2. Same problem happens in DefaultExtractor and CanolaExtractor

What is the expected output? What do you see instead?
The expected output is pure text.  But I get html. I've attached the output of 
ArticleExtractor for the same url.

What version of the product are you using? On what operating system?
I'm using 1.2.0 version on lmde (based on Debian Testing Rolling distribution)

Please provide any additional information below.

Original issue reported on code.google.com by Sharmila.Gopirajan@gmail.com on 27 Aug 2011 at 4:02

Attachments:

GoogleCodeExporter commented 9 years ago
http://www.flickr.com/photos/digitalgold/5511568109/ 
also has similar issues.

Original comment by Sharmila.Gopirajan@gmail.com on 27 Aug 2011 at 4:14

GoogleCodeExporter commented 9 years ago
Hi Sharmila,

the page at
http://fahadbangladesh.blogspot.com/feeds/posts/default?orderby=updated
is not in HTML format, but Atom XML, with XML-escaped content.

boilerpipe does not support parsing Atom XML. This is not a bug.
To avoid running into these errors, please check the MIME content type prior to 
parsing.

Best,
Christian

Original comment by ckkohl79 on 21 Mar 2012 at 9:26