keepcosmos / readability

Readability is Elixir library for extracting and curating articles.
Apache License 2.0
254 stars 58 forks source link

Add basic top image grabbing functionality #22

Closed jonzlin95 closed 5 years ago

jonzlin95 commented 7 years ago

Needed it for my own project, so I decided to add the functionality. Super simple implementation using OG:Image, Twitter:Image and scanning through all images on a page, find the image most likely to be the top image of the page.

Approach and notes are in the code. Currently supports og:, twitter:, and also will check page for images over a certain height. It handles a few errors, and I also upgraded the package to be compatible with the latest version of Floki. (Fixed some code to make tests pass again).

Am happy to make changes and improvements, if you have different ideas for implementation etc.

keepcosmos commented 7 years ago

Thanks @jonzlin95, That is the feature I want too.

But, there are some point you should fix(or improve)

First of all, fastimage is not a mature project and you use only #size feature. So I do not want to make dependency. Second, Checking "all" images(as a default) is a critical. For performance, checking all images should be optional. Third, Checking image is needed to handling exceptions(404 and something..) What do you think?

jonzlin95 commented 7 years ago

Hey @keepcosmos, I saw it on the roadmap so I decided to build it haha :).

  1. FastImage is a pretty simple library (and his implementation mirrors the Ruby/Python ones with same API). It's certainly possible to roll your own, but I looked at his code and don't see any issues with it. It is also highly unlikely to change for a long time (only minor improvements), so I don't feel any issue with it.

  2. Can agree with that. I will point out that for each image the download is only of ~50 or so bytes so the slow part is just the TCP handshake, but after that we're simply pulling the file headers. I don't want to play directly with your API since you designed the library, but let me know where you think would be a good place to add the optional functionality.

  3. Sounds like a good idea, but will add latency (Checking og: and twitter: each with a FastImage.size). I'm busy for a bit, but I can add the error handling when I get a chance.

keepcosmos commented 7 years ago

I think removing top_image in summary is better. And just use top_image function directly. Thanks, @jonzlin95. Anything use need help just let me know :)