tcort / markdown-link-extractor

extracts links from markdown texts
ISC License
23 stars 19 forks source link

Marked.js doesn't parse links in front matter headers correctly #13

Open NicolasMassart opened 3 years ago

NicolasMassart commented 3 years ago

Description of the issue

As indicated in tcort/markdown-link-check#128 the parsing of links in front matter YAML is buggy and returns all the characters even after the end of the link, so it includes quotes (as quotes are ok in YAML to delimitate string values). This seems to be a choice on the Marked.js side not to support this: markedjs/marked#485

Solving leads

We first need to check if latest Marked.js behaves in a better way.

Then there's two options:

  1. exclude the front matter header parsing from Marked.js parsing and parse it separately for links
  2. switch to a parser that handles front matter and would provide the correct result

1st option is clearly the easiest in my opinion as we don't know the effect of switching to a new parser on existing user projects.

Expectations

Markdown-link-extractor is expected to extract for all the links in markdown files including those in a front matter header.

Linked issue

7 also asks for links to be extracted from html code included in markdown. This is the same kind of request. Maybe both could be handled at the same time?

NicolasMassart commented 3 years ago

And looking more at Marked.js, there's markedjs/marked#1716 which seems to be exactly what we need here to be fixed.

wesley-dean-flexion commented 2 years ago

I'm experiencing the same issue.

I also stumbled across the front-matter library that has methods to extract the front-matter and the body (i.e., Markdown less the front-matter): https://www.npmjs.com/package/front-matter#fmstring--allowunsafe-false-

Would it be possible to insert a call to use the body to grab just the Markdown and skip the front-matter, possibly here: https://github.com/tcort/markdown-link-check/blob/master/markdown-link-check#L166