parisholley / wordpress-fantastic-elasticsearch

Improve wordpress search performance/accuracy and enable faceted search by leveraging an ElasticSearch server.
MIT License
162 stars 64 forks source link

Revisiting empty-search issue #85

Closed deltamualpha closed 10 years ago

deltamualpha commented 10 years ago

I had asked a question related to this in #48, but figured I'd open it up a little more now: should it be the business of fantastic-elasticsearch to prevent things it 'knows' are invalid queries from being sent to the elasticsearch server? I see searches for things like " " and "faaad!!!!!!!!!!!!!!!!!!!@@@@@@@@@@@@@@@######################$#$$$$Q%%%%%%%%%%%^^^^^^^^^" (unfocused attack attempt?) in my elasticsearch error logs, and worry both that there's an opening here for carefully-crafted requests as well as simple performance degradation on the elasticsearch side. (The errors are Unexpected EOF and "Encountered " "! "", respectively.)

parisholley commented 10 years ago

I think its fair that we build some safe guards in place. Is there some best practices for sanitizing data for elasticsearch?

deltamualpha commented 10 years ago

There's nothing in particular in the Elasticsearch docs about this (I looked). http://elasticsearch-users.115913.n3.nabble.com/Illegal-Characters-td1407302.html suggests that the same invalid characters as Lucerne should be escaped in a query (see also http://elasticsearch-users.115913.n3.nabble.com/Exclamation-point-breaks-field-queries-td3678504.html.)

deltamualpha commented 10 years ago

Where would it make sense to do this work? src/elasticsearch/Searcher.php?

parisholley commented 10 years ago

yep

deltamualpha commented 10 years ago

Note that http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_reserved_characters now exists. Working on implementing.

parisholley commented 10 years ago

after reviewing that link, I dont think blindly escaping is appropriate. if I want to use boolen searches, escaping would turn it off completely. perhaps it could be an option to auto escape in the admin? or maybe just have more complicated regex rules...