minimaxir / big-list-of-naughty-strings

The Big List of Naughty Strings is a list of strings which have a high probability of causing issues when used as user-input data.
MIT License
46.18k stars 2.13k forks source link

JSON cannot represent some naughty strings #20

Open ekimekim opened 9 years ago

ekimekim commented 9 years ago

One of my favourite naughty strings is invalid utf-8 - for example, a bare \xff. It's quite common to get 500s,etc on these as no-one ever bothers to check for unicode decoding errors. However, because JSON requires all strings to be valid utf-8, this example is only able to be included in the txt file.

Would this be something worth including and adding a special case in the script to omit from the json? Or is it too naughty for blns?

EDIT: This HN comment (https://news.ycombinator.com/item?id=10035738) suggested having the JSON file be of b64 encoded strings. This is a good suggestion, and allows arbitrary naughty bytes to be used, at the cost of readability.

minimaxir commented 9 years ago

I am open to a seperate .json file for b64 strings. I'll look into it today.

floyd-may commented 9 years ago

Bear in mind that base 64 represents bytes, not strings, and that strings always have an encoding.

"It does not make sense to have a string without knowing what encoding it uses."

There is no such thing as an invalid byte sequence; however, that sequence may or may not properly conform to an encoding. If you go the route of encoding the strings as base 64, you may need to consider encoding the string in a few different ways; say, ASCII if applicable, UTF8, and UTF16 (maybe ISO 8859-1 (Latin 1), Windows 1252 (Western European)?).

ekimekim commented 9 years ago

Yes, that's my point. Many systems assume all input is in a particular encoding (commonly utf-8) and may break if that is not the case. However, JSON cannot represent arbitrary byte content without some other form of encoding such as base64. I'll leave the bytes/strings distinction aside as it is an entirely semantic argument.

floyd-may commented 9 years ago

I disagree that the bytes/strings distinction is entirely semantic. Passing invalidly-encoded bytes versus pathological (but valid) cases that should be handled properly should behave differently in most systems. My vote would be to have each stringsample decorated in some way with its encoding (or null if it isn't valid). That way, invalid data (versus unusual data) can easily be identified. For example:

[
    { data: "<base 64 encoded stuff>", encoding: "ASCII" },
    { data: "<base 64 encoded stuff>", encoding: "UTF-16" },
    { data: "<base 64 encoded stuff>", encoding: "UTF-8" },
    { data: "<base 64 encoded stuff>", encoding: null },
    // naturally, lots and lots more
]
suy commented 8 years ago

I don't see the problem with JSON being limited to only valid text, when the text file is already only valid text... I mean, a byte equal to 0 is already forbidden, let alone nice combinations of broken UTF-8 or UTF-16 that would be interesting to have.

Of course that would require to introduce some sort of escaping, which would require adjusting the file to the new format. I guess that if Max didn't introduce the list with that format there are other considerations I didn't realize, right?

ssokolow commented 8 years ago

It's always possible to have two lists with the legacy-compatible subset being merged into the machine-readable version of the full list by the build process.

jfinkhaeuser commented 8 years ago

Just so that's clear, JSON does not require UTF-8 any longer, but valid unicode. The UTF-8 requirement is from an older version of JSON.

Doesn't change much about this topic, but it's worth noting.

timmc commented 4 years ago

Given that some software environments conflate bytes and strings, or naively assume ASCII (or well-formed UTF-8, or whatever), I think it makes perfect sense to include byte sequences in here that cause decoding issues.

I don't think that specifying an encoding is necessary, but having a comment explaining what the point of the byte sequence is would be useful—and that could mention the specific encoding that is being targeted.