Open jannisborgers opened 1 month ago
lychee ignores all response headers and robots.txt because we're not indexing the page. The problem must be elsewhere.
Can you save the html into a file and use that as the input?
lychee -vvv foo.html
If that works, it's because the website doesn't serve the HTML to lychee. In that case you can try curl as a user agent. https://lychee.cli.rs/troubleshooting/network-errors/#try-a-different-user-agent
If that doesn't work, lychee might have issues passing the URLs from the document. In that case, could you post an expert of the HTML file?
Hi @mre β thanks for the quick response!
I tried it on the index.html file locally, with the exact command you provided, and it yields:
π 529 Total (in 1s) β
6 OK π« 518 Errors π€ 4 Excluded
The errors come from the static html usign the password-protected URLs, so this is the result I was expecting from the local version.
I tried using lychee with the cURL user-agent like described, but it still yields:
π 0 Total (in 0s) β
0 OK π« 0 Errors
But using curl like this works and returns the HTML correctly:
curl user:password https://subdomain.domain.tld
So the way I see it, cURL itself is working, but lychee isnβt, even when using cURL as a user agent.
I suspect that the basic auth of lychee is the source of the problem. I tested on other staging sites that had basic auth and they all came back empty-handed. The output is the same as when I omit --basic-auth
on sites with basic auth:
lychee https://protected-subdomain.domain.tld
π 0 Total (in 0s) β
0 OK π« 0 Errors
I used linkchecker as an alternative, as it also provides basic-auth functionality, and it works correctly on the same URL:
linkchecker -u username -p password https://subdomain.domain.tld
Either Iβm using lycheeβs --basic-auth
flag wrong, or that functionality is not working correctly.
Oh, right, I should have read your initial message correctly.
Basic auth syntax is actually: 'example.com user:pwd' Your version is the other way round. Please try that. If it doesn't work, add the user agent as an additional parameter as well. If that doesn't work, it's a bug.
We should probably add a documentation page or document the syntax here: https://lychee.cli.rs/troubleshooting/network-errors/
@mre I have the exact same issue. I'm using the basic auth param in the right order, tried both with https:// and without. Works fine on sites without auth.
Indeed. I tried it myself and it doesn't work as advertised; sorry for the inconvenience.
Here's what I did:
import http.server
import socketserver
import base64
import os
# Set username and password for basic auth
USERNAME = 'testuser'
PASSWORD = 'testpass'
class BasicAuthHandler(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
# Check for Authorization header
auth_header = self.headers.get('Authorization')
if auth_header is None:
self.send_response(401)
self.send_header('WWW-Authenticate', 'Basic realm="Test realm"')
self.end_headers()
elif auth_header.startswith('Basic '):
# Verify credentials
credentials = base64.b64decode(auth_header[6:]).decode('utf-8')
username, password = credentials.split(':')
if username == USERNAME and password == PASSWORD:
# Serve the requested file
return http.server.SimpleHTTPRequestHandler.do_GET(self)
self.send_response(401)
self.end_headers()
# Create a simple HTML file with links
html_content = """
<!DOCTYPE html>
<html>
<body>
<h1>Test Links</h1>
<ul>
<li><a href="https://www.example.com">Example.com</a></li>
<li><a href="https://www.google.com">Google.com</a></li>
<li><a href="https://www.github.com">GitHub.com</a></li>
</ul>
</body>
</html>
"""
# Write the HTML content to a file
with open('index.html', 'w') as f:
f.write(html_content)
# Set up and start the server
PORT = 8000
Handler = BasicAuthHandler
with socketserver.TCPServer(("", PORT), Handler) as httpd:
print(f"Serving at port {PORT}")
print(f"Username: {USERNAME}")
print(f"Password: {PASSWORD}")
httpd.serve_forever()
Then I started the server
python test.py
and then I ran lychee
lychee -vvv --basic-auth 'http://localhost:8000 testuser:testpass' http://localhost:8000
π 0 Total (in 0s) β
0 OK π« 0 Errors
I saw an error on the Python server:
python test.py
Serving at port 8000
Username: testuser
Password: testpass
127.0.0.1 - - [09/Sep/2024 12:09:46] "GET / HTTP/1.1" 401 -
127.0.0.1 - - [09/Sep/2024 12:09:46] "GET / HTTP/1.1" 401 -
curl works as expected
curl -v -u testuser:testpass http://localhost:8000
So, something is off. Either it doesn't work at all, or I forgot how to use it.
It's strange, because we have tests for it: https://github.com/lycheeverse/lychee/blob/53d234d18e1eec6ec932e9d4d15d9d9862dba6a2/lychee-bin/tests/cli.rs#L1370-L1429
That said, the tests could be better, though. We don't have any negative tests (e.g. when the credentials are not provided) and we also don't check the return code, which should be 200 in case of success and 401 in case of error.
I found this thread after I tried lychee on multiple of my sites when deploying them to a staging environment and always had problems of no results.
Basic auth is a must in these situations, so I was happy that lychee supports it.
As an additional complication, the CMS weβre using sends an
X-Robots-Tag: none
response header in staging environments, as it is deliberately not supposed to be indexed. Is that something that lychee supports, or does it ignore that header? From the messages in the above thread, I could not find out if robots.txt is ignored at the moment, or not.Right now, I get the following response:
The format Iβm using is:
There are about 500 links on that page, I verified with curl that the basic auth is working correctly. It returns the HTML response.