pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
221 stars 162 forks source link

Long query with special characters causes 500 error #1574

Closed orangejulius closed 2 years ago

orangejulius commented 2 years ago

We recently saw some 500 errors in Geocode Earth production that initially gave us a bit of a scare! It turned out to be some fairly wild input from one of our users that triggered a 500 error in the Pelias API.

We've solved a few problems with special characters, emoji, long inputs, etc recently but this one is especially confusing to me.

Here's a URL that demonstrates the problem:

http://localhost:3100/v1/search?text=ASDFJK%20%D0%A1%D0%A2%D0%A0%D0%90%D0%9D%D0%AB%20%D0%91%D0%AB%D0%9B%D0%98%20%D0%9A%D0%A2%D0%9E%20%D0%A1%D0%9B%D0%A3%D0%A8%D0%90%D0%95%D0%A2%20%D0%9C%D0%95%D0%9D%D0%AF%D0%A41%F0%9F%92%92%F0%9F%8C%8B%F0%9F%9A%AD%F0%9F%9A%B1%E2%AC%86%EF%B8%8F%E2%86%97%EF%B8%8F%E2%9E%A1%EF%B8%8F%E2%86%98%EF%B8%8F%E2%AC%87%EF%B8%8F%E2%86%99%EF%B8%8F%E2%AC%85%EF%B8%8F%E2%86%96%EF%B8%8F%E2%86%95%EF%B8%8F%E2%86%94%EF%B8%8F%E2%86%A9%E2%86%AA%E2%A4%B4%EF%B8%8F%F0%9F%91%AF%E2%80%8D%E2%99%82%EF%B8%8F%F0%9F%87%B0%F0%9F%87%BF%E2%A4%B5%EF%B8%8F%F0%9F%94%83%F0%9F%94%84%F0%9F%9B%90%E2%9A%9B%F0%9F%95%89%E2%9C%A1%E2%98%B8%E2%99%8D%E2%98%AF%EF%B8%8F%E2%98%B8%E2%9C%A1%F0%9F%95%89%F0%9F%9B%90%E2%9A%9B%F0%9F%94%85%F0%9F%94%86%F0%9F%94%B4%F0%9F%9F%A0%F0%9F%9F%A1%F0%9F%9F%A2%D0%9D%D0%97%D0%95%D0%92%D0%98%20%D0%9B%20%D0%9A%20%D0%92%D0%9E%20%D0%A3%D0%9D%D0%A2%D0%92%D0%90854%D0%A332%D0%991%F0%9F%94%B5%F0%9F%9F%A3%F0%9F%9F%A4%E2%9A%AA%E2%9A%AB%E2%9A%AB%F0%9F%9F%A5%F0%9F%9F%A7%F0%9F%9F%A8%F0%9F%94%B6%EF%B8%8F%F0%9F%9F%A9%F0%9F%9F%A6%F0%9F%9F%AA%F0%9F%9F%AB%E2%AC%9B%E2%AC%9C%E2%97%BC%E2%97%BB%E2%97%BE%E2%97%BD%F0%9F%92%A0

Oddly, this URL only appears to cause issues for me when using it on the command line with curl. Pasting it into a browser works fine.

This is the error printed by the API in that case:

URIError: URI malformed
    at encodeURI (<anonymous>)
    at synthesizeUrl (/home/julian/repos/pelias/api/node_modules/pelias-microservice-wrapper/service.js:21:12)
    at /home/julian/repos/pelias/api/node_modules/pelias-microservice-wrapper/service.js:75:25
    at controller (/home/julian/repos/pelias/api/controller/libpostal.js:73:5)
    at Layer.handle [as handle_request] (/home/julian/repos/pelias/api/node_modules/express/lib/router/layer.js:95:5)
    at trim_prefix (/home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:317:13)
    at /home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:284:7
    at Function.process_params (/home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:335:12)
    at next (/home/julian/repos/pelias/api/node_modules/express/lib/router/index.js:275:10)
    at setQuerySize (/home/julian/repos/pelias/api/middleware/sizeCalculator.js:22:5)

From the browser I get a reasonable response. I assume the browser does some cleaning that helps fix things:

{
  "geocoding": {
    "version": "0.2",
    "attribution": "http://192.168.1.191:3100/attribution",
    "query": {
      "text": "ASDFJK СТРАНЫ БЫЛИ КТО СЛУШАЕТ МЕНЯФ1🇰🇿🟠🟡🟢НЗЕВИ Л К ВО УНТВА854У32Й1🟣🟤🟥🟧🟨🟩🟦🟪🟫",
      "size": 10,
      "private": false,
      "lang": {
        "name": "English",
        "iso6391": "en",
        "iso6393": "eng",
        "via": "header",
        "defaulted": false
      },
      "querySize": 20,
      "parser": "pelias",
      "parsed_text": {
        "subject": "ASDFJK СТРАНЫ БЫЛИ КТО СЛУШАЕТ МЕНЯФ1🇰🇿🟠🟡🟢НЗЕВИ Л К ВО УНТВА854У32Й1🟣🟤🟥🟧🟨🟩🟦🟪🟫"
      }
    },
    "warnings": [
      "param 'text' truncated to 140 characters"
    ],
    "engine": {
      "name": "Pelias",
      "author": "Mapzen",
      "version": "1.0"
    },
    "timestamp": 1635525115912
  },
  "type": "FeatureCollection",
  "features": []
}

The really strange part is that I've been having trouble coming up exactly what is causing the problem. I've tried taking shorter subsets of the input, they all work. I can make the query work by removing only the last URLEncoded character (%A0, corresponding to a non-breaking space). But a simple query like foo%A0 works fine, so it's not just that character.

Ideally we would figure out what part of this input is causing issues and, assuming it's truly invalid in some way, either prune out the invalid characters and return an HTTP 200, or return an HTTP 400 if we can't reasonably do that.

missinglink commented 2 years ago

Could this possibly due to how your terminal is handling the encoding? I find the way my Mac handles copy->pasting URLs unintuitive sometimes compared to Linux.

You probably tried this already, but does the behaviour change when you wrap your url in single quotes on the CLI?

orangejulius commented 2 years ago

Okay, we think we tracked down the problem here. It looks like, because the code we use to trim the text parameter is not unicode aware, it can trim the input in the middle of UTF-8 multibyte characters:

https://github.com/pelias/api/blob/130da3253b379b2ced5e4906e097fcbe1f32dfa2/sanitizer/_text.js#L26

There are packages like unicode-substring that can handle this in a safer way, and we might end up using one of those.

Some other details:

missinglink commented 2 years ago

it seems that this bug is the cause of the issue.

orangejulius commented 2 years ago

I can confirm that https://github.com/pelias/api/pull/1577 fixed this and the query shown above no longer causes a 500 error with API version v5.34.0.

Also, the issue with browsers modifying the request seems to be limited to Firefox only. Chrome was able to reproduce the error just fine. Interesting.