Platform: Linux myhostname 5.16.8-200.fc35.x86_64 #1 SMP PREEMPT Tue Feb 8 20:58:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Mercury Parser Version: 2.2.1
Node Version (if a Node bug): v16.13.2
Browser Version (if a browser bug): N/A
Expected Behavior
The mercury-parser CLI can parse any HTML text I feed it.
Current Behavior
mercury-parser can only parse content downloaded over HTTP(S). Unlike tools such as rdrview, there's no way to use a generic extractor on stdin.
Detailed Description
It should be possible to parse HTML content from a file or stdin. This could enable an infinite list of use-cases; each example use-case is an edge-case alone, but together they can add up to a significant improvement.
Sometimes I want to see how my static site would be parsed after I build it, but before I deploy it:
Another example: viewing an article in w3m after fetching it with cURL, using some cURL functionality like setting headers and using a proxy (9050 is the port used by the Tor daemon):
I frequently use something resembling the above to read articles from my RSS feed reader, but with a different extractor (rdrview) using a different algorithm (Readability). I'd like to try using Mercury for this.
Linux myhostname 5.16.8-200.fc35.x86_64 #1 SMP PREEMPT Tue Feb 8 20:58:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Expected Behavior
The
mercury-parser
CLI can parse any HTML text I feed it.Current Behavior
mercury-parser
can only parse content downloaded over HTTP(S). Unlike tools such as rdrview, there's no way to use a generic extractor on stdin.Detailed Description
It should be possible to parse HTML content from a file or stdin. This could enable an infinite list of use-cases; each example use-case is an edge-case alone, but together they can add up to a significant improvement.
Sometimes I want to see how my static site would be parsed after I build it, but before I deploy it:
Another example: viewing an article in w3m after fetching it with cURL, using some cURL functionality like setting headers and using a proxy (9050 is the port used by the Tor daemon):
I frequently use something resembling the above to read articles from my RSS feed reader, but with a different extractor (rdrview) using a different algorithm (Readability). I'd like to try using Mercury for this.