hypothesis / via

Proxies third-party PDF files and HTML pages with the Hypothesis client embedded, so you can annotate them
https://via.hypothes.is/
BSD 2-Clause "Simplified" License
19 stars 7 forks source link

Return 404 instead of 400 responses for obviously-invalid URLs #1434

Open robertknight opened 1 week ago

robertknight commented 1 week ago

Requests for "obviously invalid" URLs like https://via.hypothes.is/wp-admin return 400 responses instead of 404. This is inconvenient because we cannot easily filter out such responses in eg. New Relic metrics which monitor the overall error rate of the service.

We have encountered situations when a bot hits a large number of URLs like this in a short window of time, typically looking for vulnerabilities in common PHP packages. This triggered an alarm that fires when 80%+ of the service's requests are failing for a period of time (10-15 minutes).

The reason for the 400 here is that /wp-admin matches the general route for proxying websites which treats the part after the initial / as a URL, where the protocol is optional. CheckmateClient.check_url fails to parse wp-admin as a public URL and raises BadURL, which results in a 400 response.

For context, see https://hypothes-is.slack.com/archives/C074BUPEG/p1728300410941439?thread_ts=1728292002.576029&cid=C074BUPEG.

New Relic alert: https://one.newrelic.com/alerts/issue?account=1385283&duration=259200000&state=e0b2c426-026d-27ee-4aa8-b0894fb965d1

robertknight commented 1 week ago

Some other options:

An advantage of making these requests return a 404 in Via is that it matches how other services would respond to the same scenario, where eg. /wp-admin would not match any routes.