postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.4k stars 442 forks source link

Feature: parse support stdin/files #651

Open Seirdy opened 2 years ago

Seirdy commented 2 years ago

Expected Behavior

The mercury-parser CLI can parse any HTML text I feed it.

Current Behavior

mercury-parser can only parse content downloaded over HTTP(S). Unlike tools such as rdrview, there's no way to use a generic extractor on stdin.

Detailed Description

It should be possible to parse HTML content from a file or stdin. This could enable an infinite list of use-cases; each example use-case is an edge-case alone, but together they can add up to a significant improvement.

Sometimes I want to see how my static site would be parsed after I build it, but before I deploy it:

mercury-parser --format=html <public/path/to/article.html

Another example: viewing an article in w3m after fetching it with cURL, using some cURL functionality like setting headers and using a proxy (9050 is the port used by the Tor daemon):

curl --user-agent "my user agent" --socks5-hostname localhost:9050 --compressed $url | mercury-parser --format=html | jq '.content' -r - | w3m -T text/html

I frequently use something resembling the above to read articles from my RSS feed reader, but with a different extractor (rdrview) using a different algorithm (Readability). I'd like to try using Mercury for this.

clach04 commented 1 year ago

I'd very much like this functionality. It looks like the library supports this today, but not the CLI wrapper tool.