digest text: not all values are always replaced by ? correctly

sysown / proxysql

High-performance MySQL proxy with a GPL license.

http://www.proxysql.com

GNU General Public License v3.0

6.01k stars 978 forks source link

digest text: not all values are always replaced by ? correctly #728

Open btkador opened 8 years ago

btkador commented 8 years ago

is the digest_text supposed to look like that with a signed long value?

... VALUES (?,?,NOW(),?,?,?,?,?,?,?,?,?,?,?,?,X?,?,?,?,?,?,-?.?,?,?,?,?,?,?,?,?)

Instead of the expected ? there is -?.? for that value (i.e. -12.13123) . Also it seems that if there is a value inserted as HEX via X"<value> it results in X?.

Not sure about the NOW() since it's a mysql function, but instead there could also be sent a integer value which will result in ? and therefore not match the rule.

Also I've seen that there is a difference in white spaces.. i.e sometimes its (?,?,?,?,?) and sometimes (?, ?, ?, ?, ?) which will result in different digest hash I guess.

renecannao commented 8 years ago

I am not sure if I should consider this as a bug or as a feature request. For example, you may want to create a rules that match queries where negative floats are inserted: if -12.13123 is replaced with ? instead of -?.? , that won't be possible anymore. The same goes for HEX.

With regards to spaces: the tokenizer replaces multiples spaces with one single space, but won't remove spaces completely.

If the application generates queries in a way where spaces or spaces can become problematic, you should probably not use digest and use match_digest with more detailed regex.

Note that adding complexity to the tokenizer will slow down the Query_Processor no matter which queries is processing, that means slower by default. I think is better to leave the tokenizer simpler, and add complexity in regex.

renecannao commented 8 years ago

Thinking out loud, this could be a feature request: define a list of match/replace regex to apply to digest_text after the tokenizer. But it is quite complex to do this right now.

btkador commented 8 years ago

I think that results more in a "documentation" label for this issue, instead of feature request :-)

How is tokenizer replacing values? I expected it always replaces all kind of dynamic values by ?

a varchar or unsigned integer get's replaced by ?, right?
a signed integer or signed long will keep it's - sign in front (in case it is set)?
a long always keeps its . ?
using something like NOW() or INET_ATON() will be kept like it is?