implydata / plywood

A toolkit for querying and interacting with Big Data
https://plywood.imply.io
Apache License 2.0
384 stars 61 forks source link

Add additional filter on top of Expression #163

Open erankor opened 7 years ago

erankor commented 7 years ago

Hi all,

I have a question that is hopefully a simple one - I have an Expression object (that may or may not have a filter in it) and I would like to add another filter on top of it, that will be AND'ed with any existing filter. The context is - I'd like to patch Swiv to add some filter on its server side that will limit the scope of the query to what the specific user is allowed to see. I tried several things, my best guess was to add:

ex = ex.filter('$myField == "someValue"');

before the call to compute, but that fails with Error: could not resolve $myField (even though the field exists on all my Druid data sources) Any guidance will be appreciated

Thank you

Eran

erankor commented 7 years ago

Following some work by @esakal, we managed to get it working, but would appreciate some feedback on whether this is the right approach or maybe there's a simpler solution.

We defined a new interface for an Executor with context -

export interface ExtendableExecutor {
  (ex: Expression, env?: Environment, context?: { filters: {name: string, value: any }[]}): Q.Promise<PlywoodValue>;
}

The new executor clones the DruidExternal datasets, sets the filter of each one, and then calls ex.compute (as basicExecutorFactory does):

export interface DynamicFilterExecutorParameters {
  datasets: Datum;
}

export function dynamicFilterExecuterFactory(parameters: DynamicFilterExecutorParameters): ExtendableExecutor {
  var datasets = parameters.datasets;
  return function (ex, env, custom) {

    if (env === void 0) { env = {}; }

    let filteredDatasets: { [key: string]: any} = null;

    if (custom && custom.filters && custom.filters.length) {
      filteredDatasets = {};

      if (custom.filters.length > 1) {
        throw new Error('temporarily support only one dynamic filter');
      }

      const [ filter ] = custom.filters;

      for (var k in datasets) {
        if (!datasets.hasOwnProperty(k)) {
          continue;
        }
        const dataset = datasets[k];
        if (dataset instanceof External) {
          filteredDatasets[k] = cloneDruidExternal(dataset);
          filteredDatasets[k].filter = Expression.fromJSLoose(`$${filter.name} == "${filter.value}"`);
        } else {
          throw new Error('temporarily support external data cubes only');
        }
      }
    } else {
      filteredDatasets = datasets;
    }

    return ex.compute(filteredDatasets, env);
  };

  function cloneDruidExternal(external: External): External {

    if (!(external instanceof DruidExternal)) {
      throw new Error('temporarily support dynamic filter for druid only');
    }

    const clonedExternal = new DruidExternal(external.valueOf());

    return clonedExternal;
  }
}

Thanks

Eran

erankor commented 7 years ago

Ping

bwestergard commented 6 years ago

@erankor I'm doing something quite similar. Did you find a solution that worked for you?

erankor commented 6 years ago

@bwestergard, what I did at the end was to write a proxy (in PHP) that sits between Swiv and Druid. This proxy gets the queries built by Plywood, analyzes them, and modifies them as needed. This proxy not only adds the filtering logic that I asked about here, it also performs all sorts of manipulations on the data. For example, in our system we index ids of videos to Druid, when we see a query that splits by video id, the proxy will load the names of the videos from our operational database. This way, the user sees the video name in Swiv, in addition to the id, as if it was indexed to Druid. The only thing I had to change in Swiv/Plywood to make this solution work was to propagate the context of the request down to the Druid query, so that the proxy will know who the user is, and filter accordingly.

bwestergard commented 6 years ago

@erankor Interesting. Is the code for the PHP proxy available anywhere?

It would be nice if Plywood provided some facility for pattern matching within expressions for cases such as this ("Add a filter within every Sum(Apply(...))"). Perhaps it does, but I've yet to find it.

erankor commented 6 years ago

@bwestergard, the code is in a private repo, attaching it here with its parameters removed. The script is very specific to our application, you certainly won't be able to use it as is, but maybe it will help as a reference.

druidProxy.php.txt

In high level, the proxy performs these tasks -

  1. Add a filter limiting the query to a specific account, according to the logged-in user (the scripts gets it via a custom header from Swiv)
  2. Enrich the Druid response with names of objects from our database
  3. Perform search on database objects by name (the reverse of the previous bullet) - the search is done either using pre-made text files (for small sets) or using elastic search.
  4. Return geo-location coordinates so that countries/cities could be displayed on a map
  5. Modify a couple of Plywood queries that make use of Javascript (I didn't want to enable Javascript on our production Druid cluster)

Good luck,

Eran

bwestergard commented 6 years ago

@erankor Thanks, this should be helpful!

Laboltus commented 5 years ago

@erankor We use similar approach with proxy. But the issue with users rights we solved using brute force - just created separate docker container for each. Could you share your Swiv patch ? It looks like more elegant solution.

erankor commented 5 years ago

The patch is here - https://github.com/kaltura/swiv/blob/master/kaltura/swiv.patch The idea was to use the existing environment object to piggyback any context information we need. We set the context here - https://github.com/kaltura/swiv/blob/master/src/server/routes/plywood/plywood.ts#L50 And the plywood patch makes it available to our custom request decorator.

Laboltus commented 5 years ago

Thank you. That was very helpful.