polyfractal / sherlock

MIT License
119 stars 28 forks source link

"Filterception" - or the filterBuilder( Filter ( filterBuilder() ) issue #43

Open kwisatz opened 11 years ago

kwisatz commented 11 years ago

I've been trying to build slightly more complex queries, mostly with filters and boolean operators, but also with queries, etc.

I've run into some kind of conundrum - building "AndFilter" type filters - I don't really come to grips with. I.e. I somehow find it too complicated.

Result

What I finally want to get at is this filter:

{
  "filter": {
    "and": [
      {
        "term": {
          "tag": "university"
        }
      },
      {
        "term": {
          "country": "uk"
        }
      }
    ]
  }
}

So I've been trying a couple of alternatives, none of which are finally satisfying (although they work).

This part is common to both examples:

if( $filters ) {
   $single_filters = array();
   foreach( $filters as $type => $value ) {
      $single_filters[] = Sherlock::filterBuilder()
         ->Term()
         ->field($type)
         ->term($value);
      }
}

potential solution 1

I actually consider this more of a workaround. But it doesn't require a modification of the Sherlock lib:

$and = array();
foreach( $single_filters as $filter ){
   $and[] = $filter->toArray();
}

$sh_filter = Sherlock::filterBuilder()
   ->AndFilter()
   ->and( $and );
   //->and( array( 'filters' => $and ) );  // this works too and is necessary for _cache=true

potential solution 2

This solution would require patching BaseComponent::__call()

+   if( strtolower($name) == 'and'){
+       $this->params[$name][] = $arguments[0]->toArray();
+   } else {
        $this->params[$name] = $arguments[0];
+   }

It would allow the code to look like this:

$andWrapper = Sherlock::filterBuilder()
    ->AndFilter();

array_walk( $single_filters, function(&$single, $key) use($andWrapper) {
    $andWrapper->and($single);
});

// Or, simpler ;)

foreach( $single_filters as $single ) {
    $andWrapper->and($single);
}

Which isn't shorter, I must admit, but it would also allow simpler constructs, such as

$sh_filter = Sherlock::filterBuilder()
    ->AndFilter()
        ->and( Sherlock::filterBuilder()->... )
        ->and( Sherlock::filterBuilder()->... );

So, what say ye?

polyfractal commented 11 years ago

Hm, yes, I see what you mean. The current syntax is ugly when you have multiple Ands. Your second solution is a much nicer interface, and much more in the spirit of "fluent english".

It raises the question if this convention should be applied elsewhere. Obviously Or/Not should get the same treatment. Probably Bool too:

Sherlock::filterBuilder()->BoolFilter()
       ->should( Sherlock::filterBuilder()->... )
       ->should( Sherlock::filterBuilder()->... )
       ->must( Sherlock::filterBuilder()->... )
       ->must_not( Sherlock::filterBuilder()->... )

But I can see where this convention potentially conflicts elsewhere. In most places, Sherlock allows you to provide argument lists:

$request->index("test")
        ->type("tweet", "tweet2", "tweet3", ... )

Should the proposed syntax be extended there as well? Does it make sense to do:

$request->index("test")
        ->type("tweet")
        ->type("tweet2")
        ->type("tweet3")       

It makes the syntax much more verbose, but could also make things more consistent if everything has the same style.

There is also the option of supporting all three types of syntax ("fully fluent", array, argument list). Thoughts?

polyfractal commented 11 years ago

Unrelated, but you should probably be using a Bool instead of an And for those Term filters...it will be more performant :)

kwisatz commented 11 years ago

Oh, I'm still a total noob regarding ES. Thanks for the hint!

I've been reading up on your blog about tokenization and analysis late this night and have been working on that.

polyfractal commented 11 years ago

Np, most people don't know about the Bool vs And/Or/Not difference. It is a very common mistake :)

If you are curious about when to use one vs the other, I just recently wrote a post about it: http://euphonious-intuition.com/2013/05/all-about-elasticsearch-filter-bitsets/

kwisatz commented 11 years ago

So... I've been thinking and, yeah, this is a difficult decision. I guess trying to be as near as possible to fluent English will always result in more verbosity and quite possibly more code to maintain.

Let's be honest. At the current state of the project, with still very little documentation, it's difficult do to stuff without diving into the codebase (I'm not sure, is your documentation website a github.io repository?) and with an interface as complex as elasticsearch's, it will be difficult - if not impossible - to ever create an interface that will be very intuitive. ( "Do I need to call AndFilter() and then FilterBuilder() and then And() or the other way around?" )

I mean, the and() solution I suggested above isn't even very intuitive or "natural". It would be more natural to write it like this:

$sh_filter = Sherlock::filterBuilder()
    ->AndFilter()
    ->Term()
    ->field()
    ->value()
    ->Term()
    ->field()
    ->value()

// Or even
$sh_filter = Sherlock::filterBuilder()
    ->Term()
    ->field()
    ->value()
    ->And()
    ->Term()
    ->field()
    ->value()

But it's also very little structured, somewhat confusing and probably difficult to implement. So I would stick to the AndFilter()->And()->And() syntax.

Regarding the index()->type() interface, I would probably favor something like this:


// variant 1 - single type
$request->index('myIndex')->type('myType')->query( $myQuery );

// variant 2 - multiple types, same interface as my suggestion for And() 
$request->index('myIndex')
    ->type('myType1')
    ->type('myType2')
    ->query( $myQuery );

// variant 3 - multiple types, note the plural form of type
$request->index('myIndex')->types( 
    array( 
        'myType1', 
        'myType2', 
        'myType3'
    )
)->query( $myQuery );

Why?

Well, this allows you to write much simpler code inside IndexRequest.php (btw, you've added $type as an expected argument to type() but never use it inside the function's body) and do some type checking:

// variant 1 & 2
public function type($type)
{
    if( is_string( $type ) ) {
        $this->params['type'][] = $type;
    } else {
        # code ... maybe throw BadMethodCallException
    }
}

// variant 3
public function types( array $types )
{
    $this->params['type'] = array_merge( $this->params['type'], $types);
}

Using array_merge inside IndexRequest::types() would even allow one to do something weird like this:

$request->index('myIndex')
    ->type('myType1')
    ->types( 
        array( 
            'myType2', 
            'myType3', 
            'myType1'  // conscious choice to add myType1 again
        )
    )->...

Variant 2 would again allow the same programmatic construction using array walk and a closure as shown above with the And() filter component, but also be available to people who like typing ;)

polyfractal commented 11 years ago

AndFilter()->and()->and()

Awesome, thanks for the input. I have a whole notebook of notes and interfaces sketched out....its nice to bounce ideas off someone else now!

I agree the first example is too unstructured, it would be very hard for a programmer to know what order to chain methods. It also would make IDE autocomplete impossible - the AndFilter object would need to have all the Filters and all their properties availabe, which would make autocomplete just about useless.

index()->type() interface

Love it. Singular vs plural is a great way to separate the two...no idea why that didn't occur to me. Thanks!

Nested filters/queries

I've been thinking about the whole nested filter/query situation. Several queries allow nesting, but the Bool/And/Or/Not crowd gets used the most. Nesting several queries together with Bool is is arguably the most common operation.

I think Sherlock should be smart enough to roll this into the default behavior:

$request->index('myIndex')
    ->type('myType1')
    ->and($query1)
    ->and($query2)
    ->and($filter1)    //notice that filters can be specified too
    ->not($query3);

Sherlock would automatically A) wrap these in a big And filter, B) apply Bool vs And/Or/Not depending on the filter type (bitset or non-bitset) and C) separate queries from filters in to their appropriate location.

This would then introduce a new option to select the "default wrapping operation":

$request->index('myIndex')
    ->type('myType1')
    ->and($query1)
    ->and($query2)
    ->and($filter1)
    ->not($query3)
    ->default_operator('or');

Which would then wrap everything in an Or instead of an And. You could still do arbitrary nesting on your own:

$boolQuery = Sherlock::queryBuilder()->Bool()-> [...]

$request->index('myIndex')
    ->type('myType1')
    ->and($boolQuery)

And you would still have the option to specify an entire, arbitrary query:

$request->index('myIndex')
    ->type('myType1')
    ->query($query1)

Thoughts? It makes the interface a bit more "magic", but simplifies a common operation in Elasticsearch.

Documentation etc.

Totally hear you on that point. Lack of documentation is part lack of time, and partially because I don't want to write a bunch of docs and then rewrite them if/when the interface changes. Also partially to keep people from using Sherlock too heavily until it settles down into something more stable :)

I'm gearing up for a partial rewrite of some core internals. I don't like the structure at all, now that I've fleshed out most of the functionality. The current code is going to be hard to write good unit tests for (strongly coupled), and I've already run into a few places where extending the codebase is more difficult than it should be.

Let's just say that I made more than a few bad decisions on the 0.1 branch, and would prefer to change things while the library is still young :)

The docs are hosted on my server, but rigged up to a repo with a git hook. Feel free to fork the repo and make some additions - it uses Phrozn (php jekyll clone) with twig/markdown/html formatting.

kwisatz commented 11 years ago

Interface

I've looked into the DeleteDocumentRequest class today and it has a similar interface with document() and documents() methods.

Documentation

I'll gladly supply some documentation as I go along discovering how this thing works ;)