postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.4k stars 442 forks source link

Not parsing the full content of a webpage #636

Open amar093 opened 2 years ago

amar093 commented 2 years ago

Expected Behavior

Parse all content of webpage

Current Behavior

Parsing only partial content of webpage

Steps to Reproduce

Parse the content of https://www.startdeck.com/blog/commercial-real-estate-appraisal-a-ten-point-guide-to-cre-valuation/ with mercury parser API

Detailed Description

When I try to parse the content using mercury extension then it parses all data but otherwise parsing 2/3 part of webpage content

I think the problem is when parser changes the
tags to

tags, it is breaking the html code. Kindly look into it.

`// Another good candidate for refactoring/optimizing. // Very imperative code, I don't love it. - AP // Given cheerio object, convert consecutive
tags into //

tags instead. // // :param $: A cheerio object

function brsToPs$$1($) { var collapsing = false; $('br').each(function (index, element) { var $element = $(element); var nextElement = $element.next().get(0);

if (nextElement && nextElement.tagName.toLowerCase() === 'br') {
  collapsing = true;
  $element.remove();
} else if (collapsing) {
  collapsing = false;
  paragraphize(element, $, true);
}

}); return $; }`