postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.4k stars 442 forks source link

Mercury parser for web is not working with custom Extractor or prefetched HTML #639

Closed LilaRest closed 1 year ago

LilaRest commented 2 years ago

Expected Behavior

When I use Mercury.parse() all works fine and when I add to it a custom Extractor or some prefetched HTML the render should work including my Extractor effect and should be applied only on the given prefetched HTML.

Current Behavior

When I use Mercury.parse() all works fine but when I add to it a custom Extractor or some prefetched HTML the render is broken. Images are missing, some texts are missing too and my custom extractor is not applied.

Steps to Reproduce

  1. Import the mercury parser script for web
  2. Run the mercury parser with some prefetched HTML, for example :
    Mercury.parse(null, {html: document.body.innerHTML}).then(result => document.body.innerHTML = result.content);
  3. Or add a custom extractror to the parser and then try to use Mercury.parse()

Detailed Description

I'm trying to build a browser extension that parse web pages' contents. I do my tests on medium powered websites like : https://towardsdatascience.com/writing-a-command-line-interface-simulation-game-in-under-30-minutes-using-python-239934f34365

Thanks in advance for your help, Lilian.

jetonkoka commented 2 years ago

You need to pass in the URL in addition to the prefetched HTML. You are only passing in the HTML with null as the URL param. Hope this helps!