Feature: parse support stdin/files

Platform: Linux myhostname 5.16.8-200.fc35.x86_64 #1 SMP PREEMPT Tue Feb 8 20:58:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Mercury Parser Version: 2.2.1
Node Version (if a Node bug): v16.13.2
Browser Version (if a browser bug): N/A

Expected Behavior

The mercury-parser CLI can parse any HTML text I feed it.

Current Behavior

mercury-parser can only parse content downloaded over HTTP(S). Unlike tools such as rdrview, there's no way to use a generic extractor on stdin.

Detailed Description

It should be possible to parse HTML content from a file or stdin. This could enable an infinite list of use-cases; each example use-case is an edge-case alone, but together they can add up to a significant improvement.

Sometimes I want to see how my static site would be parsed after I build it, but before I deploy it:

mercury-parser --format=html <public/path/to/article.html

Another example: viewing an article in w3m after fetching it with cURL, using some cURL functionality like setting headers and using a proxy (9050 is the port used by the Tor daemon):

curl --user-agent "my user agent" --socks5-hostname localhost:9050 --compressed $url | mercury-parser --format=html | jq '.content' -r - | w3m -T text/html

I frequently use something resembling the above to read articles from my RSS feed reader, but with a different extractor (rdrview) using a different algorithm (Readability). I'd like to try using Mercury for this.

postlight / parser