thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

How to handle errors like "Connection broken: IncompleteRead" #725

Closed wschoot closed 7 months ago

wschoot commented 1 year ago

I'm tracking a website that sometimes gives me an errormessage like:

('Connection broken: IncompleteRead(7450 bytes read, 646 more expected)', IncompleteRead(7450 bytes read, 646 more expected))

The configuration I'm using includes the following statements that seem to have no effect on this particular error:

ignore_connection_errors: true
ignore_http_error_codes: 1xx, 4xx, 5xx
timeout: 0

I've also tried "treating" it as a timeout, by setting a stricter timeout and ignoring timeout errors like so:

ignore_connection_errors: true
ignore_http_error_codes: 1xx, 4xx, 5xx
ignore_timeout_errors: true
timeout: 10

But it doesn't really help anything. What else can I try? This is urlwatch v2.25 on Linux

thp commented 1 year ago

Have you checked whether the website gives invalid Content-length headers? Or if it's just a temporary situation under load? We could have a separate ignore_incomplete_reads: true kind of configuration. Want to make a PR? :)

wschoot commented 1 year ago

I was unable to manually test this as it only happens sometimes. I didn't yet put the effort in to make a cronjob for curl and save the output to be able to retrace the calls. I'm not too comfortable with python so making PR's is not my forte I'm afraid :)

wfrisch commented 7 months ago

A minimal reproducer that serves incomplete HTTP chunks: https://gist.github.com/wfrisch/bc00bfa049f2aab76dbb73215b1f5bb5

I have regularly observed the same problem in the wild here: https://www.mozilla.org/en-US/security/advisories/

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

wfrisch commented 7 months ago

Adding this option was straightforward. Feature branch: https://github.com/wfrisch/urlwatch/tree/feat/ignore_incomplete_reads

Steps to reproduce: Run ./http-serve-incomplete-chunks.py (https://gist.github.com/wfrisch/bc00bfa049f2aab76dbb73215b1f5bb5)

Before: urls.yaml:

name: "incomplete-chunk-server"
url: "http://localhost:8080"
urlwatch --urls urls.yaml
[...]
("Connection broken: InvalidChunkLength(got length b'\\r\\n', 0 bytes read)", InvalidChunkLength(got length b'\r\n', 0 bytes read))

After: urls.yaml:

name: "incomplete-chunk-server"
url: "http://localhost:8080"
ignore_incomplete_reads: true
./urlwatch --urls urls.yaml

→ exit code 0

wfrisch commented 7 months ago

An improved reproducer now also emulates regular incomplete reads (wrong Content-Length), as requested in the first comment: https://gist.github.com/wfrisch/63d1163645fa01e3ab1296e752769359

cat urls.yaml

url: "http://localhost:8080/invalid-content-length"
  # ignore_incomplete_reads: true
---
url: "http://localhost:8080/invalid-chunk-length"
  # ignore_incomplete_reads: true

urlwatch --urls.yaml

===========================================================================
01. ERROR: http://localhost:8080/invalid-content-length
02. ERROR: http://localhost:8080/invalid-chunk-length
===========================================================================

---------------------------------------------------------------------------
ERROR: http://localhost:8080/invalid-content-length
---------------------------------------------------------------------------
('Connection broken: IncompleteRead(13 bytes read, 10 more expected)', IncompleteRead(13 bytes read, 10 more expected))
---------------------------------------------------------------------------

---------------------------------------------------------------------------
ERROR: http://localhost:8080/invalid-chunk-length
---------------------------------------------------------------------------
("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
---------------------------------------------------------------------------

The new option silences both errors.