lycheeverse / lychee

⚑ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.17k stars 131 forks source link

Basic Auth not working #1495

Open jannisborgers opened 1 month ago

jannisborgers commented 1 month ago

I found this thread after I tried lychee on multiple of my sites when deploying them to a staging environment and always had problems of no results.

Basic auth is a must in these situations, so I was happy that lychee supports it.

As an additional complication, the CMS we’re using sends an X-Robots-Tag: none response header in staging environments, as it is deliberately not supposed to be indexed. Is that something that lychee supports, or does it ignore that header? From the messages in the above thread, I could not find out if robots.txt is ignored at the moment, or not.

Right now, I get the following response:

πŸ” 0 Total (in 0s) βœ… 0 OK 🚫 0 Errors

The format I’m using is:

lychee --basic-auth 'user:password https://subdomain.domain.tld' https://subdomain.domain.tld

There are about 500 links on that page, I verified with curl that the basic auth is working correctly. It returns the HTML response.

mre commented 1 month ago

lychee ignores all response headers and robots.txt because we're not indexing the page. The problem must be elsewhere.

Can you save the html into a file and use that as the input?

lychee -vvv foo.html

If that works, it's because the website doesn't serve the HTML to lychee. In that case you can try curl as a user agent. https://lychee.cli.rs/troubleshooting/network-errors/#try-a-different-user-agent

If that doesn't work, lychee might have issues passing the URLs from the document. In that case, could you post an expert of the HTML file?

jannisborgers commented 1 month ago

Hi @mre β€” thanks for the quick response!

Locally

I tried it on the index.html file locally, with the exact command you provided, and it yields:

πŸ” 529 Total (in 1s) βœ… 6 OK 🚫 518 Errors πŸ’€ 4 Excluded

The errors come from the static html usign the password-protected URLs, so this is the result I was expecting from the local version.

cURL user-agent

I tried using lychee with the cURL user-agent like described, but it still yields:

πŸ” 0 Total (in 0s) βœ… 0 OK 🚫 0 Errors

But using curl like this works and returns the HTML correctly:

curl user:password https://subdomain.domain.tld

So the way I see it, cURL itself is working, but lychee isn’t, even when using cURL as a user agent.

Basic auth problem?

I suspect that the basic auth of lychee is the source of the problem. I tested on other staging sites that had basic auth and they all came back empty-handed. The output is the same as when I omit --basic-auth on sites with basic auth:

lychee https://protected-subdomain.domain.tld
πŸ” 0 Total (in 0s) βœ… 0 OK 🚫 0 Errors

Other tools work

I used linkchecker as an alternative, as it also provides basic-auth functionality, and it works correctly on the same URL:

linkchecker -u username -p password https://subdomain.domain.tld

Either I’m using lychee’s --basic-auth flag wrong, or that functionality is not working correctly.

mre commented 1 month ago

Oh, right, I should have read your initial message correctly.

Basic auth syntax is actually: 'example.com user:pwd' Your version is the other way round. Please try that. If it doesn't work, add the user agent as an additional parameter as well. If that doesn't work, it's a bug.

We should probably add a documentation page or document the syntax here: https://lychee.cli.rs/troubleshooting/network-errors/

ul8 commented 1 month ago

@mre I have the exact same issue. I'm using the basic auth param in the right order, tried both with https:// and without. Works fine on sites without auth.

mre commented 1 month ago

Indeed. I tried it myself and it doesn't work as advertised; sorry for the inconvenience.

Here's what I did:

  1. Created a webserver with basic auth which serves some links behind the auth:
import http.server
import socketserver
import base64
import os

# Set username and password for basic auth
USERNAME = 'testuser'
PASSWORD = 'testpass'

class BasicAuthHandler(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        # Check for Authorization header
        auth_header = self.headers.get('Authorization')
        if auth_header is None:
            self.send_response(401)
            self.send_header('WWW-Authenticate', 'Basic realm="Test realm"')
            self.end_headers()
        elif auth_header.startswith('Basic '):
            # Verify credentials
            credentials = base64.b64decode(auth_header[6:]).decode('utf-8')
            username, password = credentials.split(':')
            if username == USERNAME and password == PASSWORD:
                # Serve the requested file
                return http.server.SimpleHTTPRequestHandler.do_GET(self)

        self.send_response(401)
        self.end_headers()

# Create a simple HTML file with links
html_content = """
<!DOCTYPE html>
<html>
<body>
    <h1>Test Links</h1>
    <ul>
        <li><a href="https://www.example.com">Example.com</a></li>
        <li><a href="https://www.google.com">Google.com</a></li>
        <li><a href="https://www.github.com">GitHub.com</a></li>
    </ul>
</body>
</html>
"""

# Write the HTML content to a file
with open('index.html', 'w') as f:
    f.write(html_content)

# Set up and start the server
PORT = 8000
Handler = BasicAuthHandler

with socketserver.TCPServer(("", PORT), Handler) as httpd:
    print(f"Serving at port {PORT}")
    print(f"Username: {USERNAME}")
    print(f"Password: {PASSWORD}")
    httpd.serve_forever()

Then I started the server

python test.py

and then I ran lychee

lychee -vvv --basic-auth 'http://localhost:8000 testuser:testpass' http://localhost:8000
πŸ” 0 Total (in 0s) βœ… 0 OK 🚫 0 Errors

I saw an error on the Python server:

python test.py
Serving at port 8000
Username: testuser
Password: testpass
127.0.0.1 - - [09/Sep/2024 12:09:46] "GET / HTTP/1.1" 401 -
127.0.0.1 - - [09/Sep/2024 12:09:46] "GET / HTTP/1.1" 401 -

curl works as expected

curl -v -u testuser:testpass http://localhost:8000

So, something is off. Either it doesn't work at all, or I forgot how to use it.

It's strange, because we have tests for it: https://github.com/lycheeverse/lychee/blob/53d234d18e1eec6ec932e9d4d15d9d9862dba6a2/lychee-bin/tests/cli.rs#L1370-L1429

That said, the tests could be better, though. We don't have any negative tests (e.g. when the credentials are not provided) and we also don't check the return code, which should be 200 in case of success and 401 in case of error.