nornir docs - URL validation

writememe commented 3 years ago

Hi nornir team,

When contributing to the docs or updating them, I thought there could be a better way to test all the links to ensure that they are alive and don't reference broken things. I've mocked up some PoC code to show how you could "crawl" the nornir docs and automate the validation of the URLs. It's not fully finished but hopefully you get the idea:

# Import libraries
from bs4 import BeautifulSoup
import requests

# Define a list of git branches to be validated
branch_list = ["latest", "develop"]
# Define a list of child-pages
child_pages = [
    "tutorial",
    "howto",
    "plugins",
    "upgrading",
    "contributing",
    "changelog",
    "api2", # deliberately broken child page
    "api",
]

def test_docs_urls(branch_list: list, child_pages: list):
    """
    A test function to test all URLs and hrefs on the official
    docs
    """
    # Initialise some empty lists to add external and internal URLs to
    external_urls = []
    internal_urls = []
    # Initialise an empty list of test URLs
    url_test_list = []
    # Iterate over git branches
    for branch in branch_list:
        # Format the base URL, using the branch
        base_index_url = f"https://nornir.readthedocs.io/en/{branch}/index.html"
        # Add the URL to a list of URLs to test
        url_test_list.append(base_index_url)
        # Iterate over all the child pages
        for page in child_pages:
            # Format the base URL, using the branch and the child page
            child_page_url = (
                f"https://nornir.readthedocs.io/en/{branch}/{page}/index.html"
            )
            # Add the URL to a list of URLs to test
            url_test_list.append(child_page_url)
        print(f"Testing {len(url_test_list)} URLs")
    # Iterate over all the URLs
    for url in url_test_list:
        print("*" * 20)
        print(f"Processing URL: {url}")
        # Make a request to get the URL
        resp = requests.get(url)
        # Validate that response was OK
        if resp.ok:
            # Display the text of the URL in str
            data = resp.text
            # Use BeautifulSoup to use the built-in methods
            soup = BeautifulSoup(data)
            # Iterate over all links on the given URL with the response code next to it
            for link in soup.find_all("a"):
                # Assign the href link to a variable
                href_link = link.get("href")
                print(
                    f"URL: {link.get('href')} " + f"| Status Code: {resp.status_code}"
                )
                if "http" in href_link:
                    external_urls.append(href_link)
                else:
                    internal_urls.append(href_link)
        else:
            print(f"URL: {url} not working - Status Code {resp.status_code}")
    print("^" * 100)
    print("EXTERNAL URLS")
    print(f"{set(external_urls)}")
    print("INTERNAL URLS")
    print(f"{set(internal_urls)}")
    test_external_urls(set(external_urls))
    return set(external_urls)

def test_external_urls(urls):
    """
    Test all external URLs defined in the documentation
    to ensure that they are still valid
    """
    for url in urls:
        # Make a request to get the URL
        resp = requests.get(url)
        # Validate that response was OK
        if resp.ok:
            print(f"Success: {url}")
        # Else, print status code
        else:
            print(f"Failure: {url} - Status Code: {resp.status_code}")

if __name__ == "__main__":
    test_docs_urls(branch_list, child_pages)

Would you be interested in implementing this/adding it to the CI tests? Might be a way of saving time in the long run. If so, I can properly flesh this thing out and submit it as a pytest unit test perhaps? All good if you're rather not.

dbarrosop commented 3 years ago

Yeah, that sounds great. I am pinging @ogenstad because I know he did something similar for the napalm-automation website. Can't recall exactly what he did but maybe he can chime-in. Regardless of the implementation details (this script or something else), I am up for it.

Thanks a lot for bringing this up and proposing a solution already. Let's see what Patrick says and take it from there.

ogenstad commented 3 years ago

On the Napalm website we used html-proofer, which I think is great. It validates the html and links based on a locally generated site. The main downside is that it's written in Ruby so it needs a Ruby environment to run. With Napalm that's fine since Jekyll (also in Ruby) is used as the static site generator. Here's how it's used with Napalm: https://github.com/napalm-automation/napalm-automation.github.io/blob/master/.travis.yml#L17

I'm all for something like, as I like many others have much love for broken links.

I do however think that it should be clear what it is we are testing and what the end goal is. One think about the above sample code is that it would only check the validity of links that have already been committed and published online. So if someone added a broken link we'd only see that after the PR had been committed and another one was opened.

I'd suggest that we instead test the links of the local build, could perhaps also schedule a Github Action to also run once a month or so to check if any of the current links have died.

Looks like Sphinx actually has a parameter to reference external links with "linkcheck": https://www.sphinx-doc.org/en/master/man/sphinx-build.html. That seems like a good place to start. Unfortunately it doesn't seem to check internal links. If we created the files for the output locally we could parse those files with either a script like the one above or a tool such as html-proofer. If it doesn't add to much to the CI run perhaps html-proofer could be installed.

@writememe, do you want to take a look at the linkcheck parameter to Sphinx?

Should probably also have this for the nornir.tech site too as that one contains more external links that may die.

writememe commented 3 years ago

Sorry for the late reply, let me address the points made:

I do however think that it should be clear what it is we are testing and what the end goal is

1) To test (within reason) that internal and external hyperlinks are valid within the develop branch, prior to pushing the documentation to the latest branch. 2) You've mentioned a periodic check of all external links on the current website to ensure to be informed of dead links.

@writememe, do you want to take a look at the linkcheck parameter to Sphinx?

I can't tell where these docs get built from the repo, but seems like you literally add the linkcheck arguement to wherever that happens.

I think that upon merge into the develop branch, you could run the internal and external link checker. At that point, you at least know that any new links and all existing links are going to be tested into the develop branch. I'm not sure whether that's better than nothing, up to you and the team.

dbarrosop commented 3 years ago

I think #697 does it. There are a few warnings and things that need to be fixed but seems to do the job:

https://github.com/nornir-automation/nornir/pull/697/checks?check_run_id=2868121098#step:10:284

nornir-automation / nornir

nornir docs - URL validation #683