manuzhang / mkdocs-htmlproofer-plugin

A MkDocs plugin that validates URL in rendered html files
MIT License
40 stars 16 forks source link

Incorrect broken links reported #70

Closed SeanTAllen closed 5 months ago

SeanTAllen commented 6 months ago

Describe the bug

I noticed this with the blog plugin for mkdocs material. Any link to a blog post referenced as "/blog/posts/blah.md" will be identified as not having a source file.

This turns out to be based on how the files dictionary is used. At the time that the key for the files is set up, it uses the url which isn't going to be what is looked up in the blog case. The current setup assumes that the dest_uri will the the url at the time of on_files. That isn't the case with the blog plugin and might not be the case with other files (at this time, I have only verified with the blog plugin).

At the time of "on_files", the information about what the dest_uri's will be is not yet available. It is the same value as "url". However, by the time we are checking to see if a path exists, each File object has been updated and the dest_uri is correct.

The bug can be fixed by making self.files a list and then in find_source_file, iterate over the list of all files and compare search_path to the dest_uri of each File. If the two match, we've found the correct file object to return.

To Reproduce

Turn off use_directory_urls, run html-proofer on a site that is using the mkdocs-material blog plugin and linking to posts like /blog/posts/foo.md.

Environment (please complete the following information):

The occurs in GitHub CI using the following config: https://github.com/ponylang/ponylang-website/blob/main/.github/workflows/pr.yml#L33

It occurs for me locally with Ubuntu 22. Local python version is Python 3.10.12

pip list:

Package                    Version
-------------------------- ----------------------
Babel                      2.14.0
beautifulsoup4             4.12.3
blinker                    1.4
certifi                    2024.2.2
charset-normalizer         3.3.2
click                      8.1.7
colorama                   0.4.6
cryptography               3.4.8
dbus-python                1.2.18
distro                     1.7.0
distro-info                1.1build1
ghp-import                 2.1.0
gitdb                      4.0.11
GitPython                  3.1.41
httplib2                   0.20.2
idna                       3.6
importlib-metadata         4.6.4
jeepney                    0.7.1
Jinja2                     3.1.3
keyring                    23.5.0
launchpadlib               1.10.16
lazr.restfulclient         0.14.4
lazr.uri                   1.0.6
Markdown                   3.5.2
MarkupSafe                 2.1.5
mergedeep                  1.3.4
mkdocs                     1.5.3
mkdocs-ezlinks-plugin      0.1.14
mkdocs-htmlproofer-plugin  1.1.0.dev0
mkdocs-material            9.5.10+insiders.4.52.2
mkdocs-material-extensions 1.3.1
mkdocs-rss-plugin          1.12.1
more-itertools             8.10.0
oauthlib                   3.2.0
packaging                  23.2
paginate                   0.5.6
pathspec                   0.12.1
pip                        22.0.2
platformdirs               4.2.0
Pygments                   2.17.2
PyGObject                  3.42.1
pygtrie                    2.5.0
PyJWT                      2.3.0
pymdown-extensions         10.7
pyparsing                  2.4.7
python-apt                 2.3.0+ubuntu2.1
python-dateutil            2.8.2
PyYAML                     5.4.1
pyyaml_env_tag             0.1
regex                      2023.12.25
requests                   2.31.0
SecretStorage              3.3.1
setuptools                 59.6.0
six                        1.16.0
smmap                      5.0.1
soupsieve                  2.5
tzdata                     2024.1
unattended-upgrades        0.1
urllib3                    2.2.0
wadllib                    1.3.6
watchdog                   4.0.0
wheel                      0.37.1
zipp                       1.0.0

Contents of mkdocs.yml can be seen at:

https://github.com/ponylang/ponylang-website/blob/main/mkdocs.yml


I'm happy to open a PR with the fix detailed earlier. If you would like me to do so, please let me know. I'm going to prep that soon and switch us to using our fork with the fix. Opening a PR would be very easy after that.

manuzhang commented 6 months ago

@SeanTAllen Thanks for reporting the issue. Please go ahead if you've got the fix ready.

SeanTAllen commented 6 months ago

@manuzhang see https://github.com/manuzhang/mkdocs-htmlproofer-plugin/pull/72

SeanTAllen commented 6 months ago

@manuzhang I will need to do an update to this. It currently as I did it, only works for links into the blog from outside of it. It doesn't work for intra-blog links.

manuzhang commented 6 months ago

@SeanTAllen Appreciate it! I will wait for your fix to make a new release.

SeanTAllen commented 6 months ago

@manuzhang this is going to be tricky. the problem is that when you are in a blog post, the relative link logic to figure out from a url like "other-post.html" is incorrect, but only if you are inside that magical "mkdocs material blogpost land". Otherwise it is correct.

So there's not a simple fix at the point of lookup. There might be a fix in terms of lifecycle and doing checks at a different point or collecting additional information to do resolution. It also might be that this part of the problem is difficult to solve (like the directory urls and anchors issue).

I wouldn't hold a release on this.

The fix is an improvement and it might be a decent amount of time before I can figure out how to address this.

SeanTAllen commented 6 months ago

I think I know how toto address.

SeanTAllen commented 6 months ago

Ok, I know how to address. @manuzhang. There's two variations on how this can be coded.

The key to all this is that the relative path lookup in find_source_file is overly simplistic. It makes assumptions that aren't true within the blog.

What we need to do is take the src_path and use that the find the src_uri for the page we are checking links for. We then use the src_uri to get the parent uri rather than using the src_path.

There are a couple ways to deal with this without doing a linear scan of the files Dict in find_source_file.

We can store a lookup by both url and src_uri in the Files lookup dictionary and use that one dictionary and use it for both purposes.

We can create a second dictionary and pass it all the way down through.

You can see a "single dictionary" version here: https://github.com/SeanTAllen/mkdocs-htmlproofer-plugin/tree/intra-blog-links

SeanTAllen commented 5 months ago

btw, from my manual testing on our site, this fix also fixes issue #46.

manuzhang commented 5 months ago

@SeanTAllen Thanks for your great effort to debug and come up with solutions over the weekends. Are you planning a PR soon?

SeanTAllen commented 5 months ago

Yes. Would you prefer the single dictionary solution I currently have? If yes, I can open a PR soon. If you prefer the two dictionaries, that will take a bit more time. Probably a week or so as I have other things to get done this week.

manuzhang commented 5 months ago

I think single dictionary will do

SeanTAllen commented 5 months ago

I'll open a PR sometime in the next couple of days.

SeanTAllen commented 5 months ago

@manuzhang PR is open

manuzhang commented 5 months ago

@SeanTAllen Thanks for working on this issue and other improvements. If there are no other issues, I'm going to release 1.1.0 soon.

SeanTAllen commented 5 months ago

Nothing else at the moment