postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.41k stars 442 forks source link

Parse pre-fetched HTML with command-line tool #564

Open acontia opened 4 years ago

acontia commented 4 years ago

Hi,

With the command line tool, is it possible to parse custom or pre-fetched HTML by passing an HTML string to the parse function?

I want to do something like the following, but using the command line tool provided:

Mercury.parse(url, {
  html:
    '<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));

I tried the following but it doesn't seem to be supported:

./mercury-parser "http://example.com" --html='<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>'

Any idea?

ttimasdf commented 3 years ago

write a custom wrapper, but it seems that there's some encoding problems when passing pre-fetched data into Mercury.parse ref: https://github.com/ttimasdf/ArchiveBox/commit/78477dc387908677fb65c7d0f1b09edd1063d970#commitcomment-42621227