pat / riddle

Ruby Client API for Sphinx
MIT License
135 stars 67 forks source link

SQL escaping for snippets #71

Closed epidemian closed 11 years ago

epidemian commented 11 years ago

This is more or less what i had in mind for issue #63 (and #70).

I think trying to make Riddle::Query.escape do both MySQL escaping and SphinxQL escaping is a bit confusing, so i used another function for escaping the snippets text, which doesn't need to be SphinxQL escaped.

Now, i'm quite confused about Riddle's code at many points. Some things i'd like to understand better:

So, i hope it's clear that i don't have a good grasp of Riddle's inner workings. That being said, shouldn't Riddle escape all strings that come "from the outside" before using them as string is SQL queries? (both to avoid SQL syntax erros and potential SQL injections) And also apply SphinxQL escaping only to the strings that will be used in MATCH statements before the SQL escaping, or maybe leave that to the user so they can decide if they want extended matching or not (note that if you Mysql2::Client.escape(Riddle.escape('@')) then you get that double backslash that allows Sphinx to read the @ as a literal).

pat commented 11 years ago

What's the specific purpose of Riddle::Query.escape? In what ways does it differ from Riddle.escape?

Riddle::Query.escape is for SphinxQL/mysql41 queries, whereas Riddle.escape is for Sphinx's older binary protocol.

Why does Riddle::Query.escape adds double backslashes for special SphinsQL characters? Shouldn't string.gsub(/[()|-!@~"&\/^\$=]/) { |m| "#{m}" } be enough (based on PHP's API)?

The PHP API is for the binary protocol.

Why does Riddle::Query.escape remove all backslashes from its input?

To avoid partially/badly escaped input. I'm open to discussion on this, perhaps it's not the right approach...

Why is there a separate Riddle::Client#excerpts function with its own separate specs? What's the difference between that and Riddle::Query.snippets and why does ThinkingSphinx use the latter?

Again, the different protocols - Riddle::Client#excerpts is for generating excerpts over the binary protocol, Riddle::Query.snippets generates a SphinxQL query to essentially do the same thing over the SphinxQL/mysql41 protocol.

With all that said, your patch is fine and I'll merge it in. Perhaps everything except MATCH strings should be escaped by default... I'm certainly open to that idea, though would be keen to see where/how it impacts Riddle/TS code.

epidemian commented 11 years ago

Nice, thanks for the pointers (and the merge!), @pat; there was a whole forest i was not aware of (Sphinx' binary protocol).

Perhaps everything except MATCH strings should be escaped by default...

Maybe the MATCH strings could be SQL escaped at the Riddle level (so syntax errors and possible SQL injections are prevented) and let the user decide whether to SphinxQL-escape the query or not.

Something that is not very clear to me from reading the docs: is the expected usage of ThinkingSphinx to always use Riddle.escape to sanitize user input, or should that be used only if one doesn't want to support extended matching syntax? I'd guess it's the later, but i'm not sure.

SomeModel.search params[:query] # Search with extended matching syntax (i.e. "", @, etc).
SomeModel.search Riddle.escape(params[:query]) # Prevent extended matching syntax magic.

In the case of our application, having things like exact phrase match with "" is quite convenient, so we don't use Riddle.escape on the user's input, but i'm not sure if we're letting ourselves open for injection attacks or things of that sort :stuck_out_tongue_closed_eyes:

pat commented 11 years ago

I guess injection is now an option with the SphinxQL protocol - it certainly isn't with the old binary protocol (which I should mention is what TS v1/v2 use), because you can't update records in that manner. Nor can you with SQL-backed indexes, only real-time indices, but I do think you could be onto something with MySQL-escaping all query values.

And just to be clear: you've used Riddle.escape in your comment, but that's for the binary protocol. You meant Riddle::Query.escape, right? :)

The expected behaviour is developers should escape any queries where they feel users may enter in invalid query strings - but yes, then we come down to whether extended matching syntax is 'invalid' or not. To some extent that's contextual - hence the possible value in using Mysql2::Client.escape always to be sure that injection isn't occurring.

epidemian commented 11 years ago

And just to be clear: you've used Riddle.escape in your comment, but that's for the binary protocol. You meant Riddle::Query.escape, right? :)

Yes, probably (now i'm checking the implementations once again he he).

Yeah, Riddle::Query.escape is what should be used for Sphinx > v3. But assuming that the underlying implementation of ThinkingSphinx.search changes to do SQL-escaping on its input then the SphinxQL-escaping method provided to the user should do something more similar to Riddle.escape, i.e. escaping things like " or @ with a single backslash instead of two, right?

pat commented 11 years ago

I'm not following... are you suggesting that when queries are always escaped it should use Riddle.escape? I was thinking just Mysql2::Client.escape.

Beyond that, I think Riddle::Query.escape handles the rest - as SphinxQL expects @ and other such characters to be escaped with two backslashes, not one.

epidemian commented 11 years ago

Hmm... i wasn't very clear it seems :sweat_smile: (consider though that i'm not very familiarized with these things... so this is more of an exploratory question than a suggestion)

are you suggesting that when queries are always escaped it should use Riddle.escape? I was thinking just Mysql2::Client.escape.

What i meant was to always use Mysql2::Client.escape when building SQL strings from external input (in the Riddle code) and providing a function that does one-backslash escaping of SphinxQL characters (i.e. what Riddle.escape does) to let the user decide if they want to SqphinxQL-escape their queries first (before going through the Mysql2::Client.escape'ing in the Riddle code).

Beyond that, I think Riddle::Query.escape handles the rest - as SphinxQL expects @ and other such characters to be escaped with two backslashes, not one.

I think it's not so much that SphinxQL requires a doubly-backslashed @. It's that when two backslashes are used in a MySQL string then MySQL will interpret that as a single backslash. So we need two backslashes in the SQL query to make SphinxQL see only one:

mysql> SELECT 'hello\@world';
+-------------+
| hello@world |
+-------------+
| hello@world |
+-------------+
1 row in set (0.00 sec)

mysql> SELECT 'hello\\@world';
+--------------+
| hello\@world |
+--------------+
| hello\@world |
+--------------+
1 row in set (0.00 sec)

The things is, if we always use Mysql2::Client.escape when building SQL strings (e.g. somewhere under ThinkingSphinx.search hood) then that will take care of converting single backslashes into two. '@' -> Riddle.escape -> '\@' -> Mysql2::Client.escape -> '\\@'.

I think the confusing thing with this escaping stuff is the nesting level of escapes. In Ruby code, four backslashes in a string mean two "real" backslashes in the string itself; and when that string is interpreted as a MySQL string, MySQL will interpret it as a single backslash, which is what Sphinx will end up with (i don't know if i said it correctly xD)

pat commented 11 years ago

Okay, I'm understanding this now - I think in my head I had Riddle::Query.escape and Mysql2::Client.escape around the wrong way when passing through input.

Try this on for size: d0da3fa9a0c2d1eb73e18653f329e340b6c2c053.

epidemian commented 11 years ago

Wow, that commit changes things around! I think it looks good (i'll try to give it a spin at work after the weekend and see if something breaks).

Thanks!