postlight / parser-api

🚀 A drop-in replacement for the Postlight Parser API.
Apache License 2.0
282 stars 113 forks source link

<img ....> gets corrupted when using parse-html #20

Open fappelman opened 5 years ago

fappelman commented 5 years ago

Expected Behavior

The parser should not corrupt the \ content.

Current Behavior

The \ tag originally is

  <img src=\""/>

and after parsing

 <img src="">

Steps to Reproduce

  1. Take the following HTML

Main content
<img src=""/>
More content
More Content to Simulate main content.
<img src=""/>
  1. Call the api with the path /parse-html. The API takes a POST with a JSON object containing a URL and HTML. The HMTL is the HTML as provided in step 1 but is first converted to the following format:
<html>\\n<head>\\n<body>\\nMain content\\n<br/>\\n<img src=\"\"/>\\nMore content\\n<br/>\\nMore Content to Simulate main content.\\n<img src=\"\"/>\\n</body>\\n</html>\\n

and the URL value that is passed is

  1. The JSON result content being returned contains the main content including the images. The image values are however corrupted:
<img src="">
<img src="">


Am I using the API in a correct way? I could not find any documentation so this is a bit of reverse engineering.

The reason for not doing this directly, i.e. using the /parser?url=..... is that I am trying to work around a problem where a TypeError is returned. See. The page gives back a 202 which the parser cannot handle. I am now downloading the content and try to pass the HTML into the API as a workaround instead. Unfortunately it doesn't react as I expected it would.