Sanitize json with control characters

python273 / vk_api

Модуль для создания скриптов для ВКонтакте | vk.com API wrapper

https://pypi.org/project/vk-api/

Apache License 2.0

1.33k stars 324 forks source link

Sanitize json with control characters #473

Open trifle opened 2 years ago

trifle commented 2 years ago

Hi!

I recently saw that VK seems to be really bad at sanitizing some fields such as usernames. Loading a bunch of members with vk_api yielded an ugly JSONDecodeError: Invalid control character at: line 1 column 68248 (char 68247)

Inspecting this by hand, I found that the response contains this abomination:

In [32]: e[68200:68259]
Out[32]: ',"is_closed":false},{"id":27497241,"nickname":"\x01","domain":'

Of course, <0x01> is not a codepoint that you'd want in a nickname, ever!

Since the data is nonsensical, not printable and potentially dangerous, my suggestion would be to catch JSONDecodeErrors and try to sanitize the raw content before parsing it with json.loads.

There are many options for replacing the problematic characters, but regex should be reasonably fast. As a bonus, it allows us to catch control characters as a category (\p{C}) rather than listing them by hand.

python273 commented 2 years ago

might be better to use strict=False instead of regexp?

https://docs.python.org/3/library/json.html#json.JSONDecoder

trifle commented 2 years ago

Hi @python273, thanks for taking the time to look into it!

I've actually thought about using that flag: As far as I read it, this simply allows the json decoder to pass on this invalid data. But I don't think it's best to use that here, because the resulting data would almost certainly cause problems for people. I mean, is there any valid use case where you would want a null byte or similar in a VK field to process further?

I also feel that delimiters are another special case. As far as I see it, having "\r" or "\n" in json data (unescaped!) means that almost certainly something on the sending end is broken. Because all tooling that works with newlines will be broken, and you essentially can't rely on any of the data to remain sensible (i.e. where do you stop parsing? Just kill everything before the newline?).

A final consideration is the maintainability. From that perspective I could see why using the json decoder flag is more attractive.

What do you think?

python273 commented 2 years ago

I reported the bug to VK, hopefully they just fix their JSON encoder to escape control chars ('\x01' as '\\u0001')

Not sure about regexp, since it alters the content of the response, so let's see what VK replies

trifle commented 2 years ago

Thanks, that's a great idea. I've encountered way too many double- or under-encoded data, one less is certainly a good idea. I'll keep using my fork in the meantime, so feel free to close this PR. BTW, there still is a typo, the import is named re not rx, I'll change that.