matth / juicer

Juicer is a web API for extracting text, meta data and named entities from HTML "article" type pages.
http://juicer.herokuapp.com/
MIT License
60 stars 18 forks source link

Is there a way to get content with tags #7

Open squallstar opened 10 years ago

squallstar commented 10 years ago

Hello,

thanks for making such a good open-source library. I was wondering whether there is a way to have html tags in the content, since it seems to strip everything.

Thanks

matth commented 10 years ago

Hey, I'm not sure I understand, do you mean things like

and tags?

Unfortunately the underlying library strips out any formatting like this so it would not be possible to display it.

squallstar commented 10 years ago

Yeah I meant tags, the thing is that the library is also stripping any image (tag/url) and links from the content — therefore I'm not able to properly display articles with images between paragraphs. :( any clue?

matth commented 10 years ago

No should extract the image too as a separate field (if it finds one it deems suitable :)

Here's an example:

http://juicer.herokuapp.com/api/article?url=http://www.bbc.co.uk/news/world-africa-16377824

Are you installing yourself or using the deb package? I think it's up-to-date with the image extraction feature but might be wrong!

squallstar commented 10 years ago

Well it extracts the "lead" image, but if the article has more than one image it doesn't keep the other ones that were placed in the middle of the content.

e.g. http://juicer.herokuapp.com/api/article?url=http://www.polygon.com/2014/9/25/6255023/forza-horizon-2-review-xbox-one