Request for comments: Should we remove fetching HTTP functionality?

keepcosmos / readability

Readability is Elixir library for extracting and curating articles.

Apache License 2.0

253 stars 58 forks source link

Request for comments: Should we remove fetching HTTP functionality? #64

Open Valian opened 1 week ago

Valian commented 1 week ago

Right now there's Readability.summarize(url) function fetching the article and then parsing it.

I'm thinking about:

removing fetching functionality from Readability
removing httpoison from dependencies
relying on Readability.article(html) as an entrypoint to the library, with the expectation that user will get HTML on his own

Why?

there are various approaches to scraping. Some apps use Req, some HTTPoison, some Tesla. Using multiple clients in a single app doesn't really make sense.
People might need different settings - HTTP headers, proxy etc
There's some maintenance overhead of keeping it around - updating dependencies etc.

Thoughts? Maybe @vkryukov @philipbrown ?

philipbrown commented 1 week ago

@Valian Yeah, that sounds good to me 👍

vkryukov commented 1 week ago

I had the same exact idea! I'm using Req for my use case, and have essentially re-implemented Readability.summarize to work on raw html responses and URLs. +1

vkryukov commented 1 week ago

While we are here (and since that might be a breaking change from the API perspective anyways), should we discuss renaming summarize? I don't think it's the best name as it does not technically summarize anything, just extracts different parts of the webpage.

vkryukov commented 1 week ago

Some thoughts about simplifying the api:

Readability.article(html), as proposed above, returns an %Article{} structure with all the fields populated.
We don't have separate Readability.{title, published_at} etc. functions - they don't add much to the table (just parse the article and grab the fields you need).
Potentially add some helpers, such as Readability.article_from_file(filename) and such.

Some downsides of this approach:

Response headers can help determine the type of the file (e.g., we don't want to start parsing a PDF thinking that it's an HTML)
URL also contains some useful information (e.g., newspaper3k extracts the date from it).