Validate anchor links across pages

manuzhang / mkdocs-htmlproofer-plugin

A MkDocs plugin that validates URL in rendered html files

MIT License

43 stars 16 forks source link

Validate anchor links across pages #22

Closed johnthagen closed 3 years ago

johnthagen commented 3 years ago

Closes #18

johnthagen commented 3 years ago

@manuzhang I added a "failing" test page that should generate a warning but currently does not.

Unlike c637c44531e5d654581721a04c10bf28cae141a8 we can't use the soup that is passed into get_url_status because that is for the current page, not the one that is being linked to.

johnthagen commented 3 years ago

It seems like we would need to resolve the URL (e.g. index.html#BAD_ANCHOR) and then parse the destination using BeautifulSoup to verify if the anchor is correct. This seems challenging given the user could be running mkdocs build rather than mkdocs serve so there wouldn't be something that could be queried directly using requests.

johnthagen commented 3 years ago

Perhaps the on_post_build() hook could be used somehow.

johnthagen commented 3 years ago

If in our current page we have a link such as:

[link](index.md#elephant)

And index.md has a header such as:

# Elephant

We need to search the built target index.md page for

<a href="#elephant" class="nav-link">Elephant</a>

johnthagen commented 3 years ago

Another option to consider is that technically we could try to parse the target markdown source file rather than trying to locate and query the actual built HTML.

johnthagen commented 3 years ago

@manuzhang I have implemented a method that finds the target Markdown source and validates it contains a Markdown header for the cross-page anchor. Could you give this a review and tell me what you think?

johnthagen commented 3 years ago

This still needs a bit more work handling headers with multiple words separated by spaces.

johnthagen commented 3 years ago

Here are GitLab's rules for how Markdown headers are produced: https://stackoverflow.com/a/43276249 MkDocs seems to do something similar. It may be that trying to go backwards from URL anchor to Markdown header is a bit too complex.

manuzhang commented 3 years ago

@johnthagen thanks for the continuous investigation on this task. I might only have time to check and test during weekend.

johnthagen commented 3 years ago

@manuzhang No problem. This issue may end up being very difficult to address, so we may have to abandon it or perhaps you or someone else will come along with a better way to solve this issue than I have been able to come up with.

johnthagen commented 3 years ago

Another idea could be to try to use MkDocs or python-markdown's actual functionality to slugify identified Markdown headers and then compare them with the anchor being checked:

johnthagen commented 3 years ago

Heading checking seems to be working, but testing this on a larger project revealed that find_source_markdown() needs more work. It doesn't handled nested markdown files that have relative links between them.

johnthagen commented 3 years ago

@manuzhang This is ready to review now. I tried out this PR on a large MkDocs project I maintain and it found two true errors in anchors that otherwise would not have been detected. It did not have any false positives in my project either.

manuzhang commented 3 years ago

@johnthagen Thanks for the nice work !

johnthagen commented 3 years ago

@manuzhang Sure! I think it would be good to cut a new release with this feature included for people to try out.