keepcosmos / readability

Readability is Elixir library for extracting and curating articles.
Apache License 2.0
253 stars 58 forks source link

Request for comments: Should we remove fetching HTTP functionality? #64

Open Valian opened 1 week ago

Valian commented 1 week ago

Right now there's Readability.summarize(url) function fetching the article and then parsing it.

I'm thinking about:

Why?

Thoughts? Maybe @vkryukov @philipbrown ?

philipbrown commented 1 week ago

@Valian Yeah, that sounds good to me 👍

vkryukov commented 1 week ago

I had the same exact idea! I'm using Req for my use case, and have essentially re-implemented Readability.summarize to work on raw html responses and URLs. +1

vkryukov commented 1 week ago

While we are here (and since that might be a breaking change from the API perspective anyways), should we discuss renaming summarize? I don't think it's the best name as it does not technically summarize anything, just extracts different parts of the webpage.

vkryukov commented 1 week ago

Some thoughts about simplifying the api:

  1. Readability.article(html), as proposed above, returns an %Article{} structure with all the fields populated.
  2. We don't have separate Readability.{title, published_at} etc. functions - they don't add much to the table (just parse the article and grab the fields you need).
  3. Potentially add some helpers, such as Readability.article_from_file(filename) and such.

Some downsides of this approach:

  1. Response headers can help determine the type of the file (e.g., we don't want to start parsing a PDF thinking that it's an HTML)
  2. URL also contains some useful information (e.g., newspaper3k extracts the date from it).