maggienj / ActiveData

Provide high speed filtering and aggregation over data
Mozilla Public License 2.0
0 stars 0 forks source link

Fix - test_time_domain.TestTime.test_time2_variables #43

Closed maggienj closed 7 years ago

maggienj commented 7 years ago

Changed size 0 for meta.py issue and this was a follow up of that issue.

maggienj commented 7 years ago

err... "Aggregator [_match] of type [value_count] cannot accept sub-aggregations" "aggs": { "v": { "extended_stats": { "field": "v" } }

_since we have "value_count", not sure, if we still need this "extendedstats" ?

{
  "aggs": {
    "_match": {
      "aggs": {
        "v": {
          "extended_stats": {
            "field": "v"
          }
        }
      },
      "value_count": {
        "field": "a"
      }
    },
    "_missing": {
      "aggs": {
        "v": {
          "extended_stats": {
            "field": "v"
          }
        }
      },
      "filter": {
        "or": [
          {
            "missing": {
              "field": "a"
            }
          },
          {
            "not": {
              "terms": {
                "a": [
                  "x",
                  "y"
                ]
              }
            }
          }
        ]
      }
    }
  }
}

{

    "error": {
        "root_cause": [
            {
                "type": "aggregation_initialization_exception",
                "reason": "Aggregator [_match] of type [value_count] cannot accept sub-aggregations"
            }
        ],
        "type": "aggregation_initialization_exception",
        "reason": "Aggregator [_match] of type [value_count] cannot accept sub-aggregations"
    },
    "status": 500

}
maggienj commented 7 years ago

if sub_aggs for ("extended_stats") is removed then it doesn't raise the "sub_aggs" err.

{
  "aggs": {
    "_match": {
      "value_count": {
        "field": "a"
      }
    },
    "_missing": {
      "aggs": {
        "v": {
          "extended_stats": {
            "field": "v"
          }
        }
      },
      "filter": {
        "or": [
          {
            "missing": {
              "field": "a"
            }
          },
          {
            "not": {
              "terms": {
                "a": [
                  "x",
                  "y"
                ]
              }
            }
          }
        ]
      }
    }
  }
}

the above code doesn't raise a "sub_aggs" err in es head ( it does raise a diff err ) not sure.... if we have to put the "if...else...condition " in the place where it adds the "extended_stats" ?

klahnakoski commented 7 years ago

It may work, but I am concerned that the response of Elasticsearch is not what the aggs_iterator is expecting: aggs_iterator expects an inner _match for each edge in the ActiveData query, plus an aggs for each select. By removing one of those inner objects there is a mismatch.

Maybe value_count is wrong: Try the filter aggregation; it allows sub-aggregations.

maggienj commented 7 years ago

In aggs.py... in es_aggsop function. There exists one section where it has different stats and it shows different aggops. It has separate "if condition blocks" for different stats. it already has value_count as part of it. But, this test (test_time2_variables) is using the aggsop, "sum". And there doesn't exist a separate "if condition block" for "sum".... so the code flows to the "else" part....

Should a new "if condition block" be created for "sum" in this section? for s in many: if s.aggregate == "count": es_query.aggs[literal_field(canonical_name)].value_count.field = field_name s.pull = literal_field(canonical_name) + ".value"

maggienj commented 7 years ago

Posted below, is the query it is generating now. Just wondering, how the correct query should look like for this test-query?


    {
        "aggs": {
            "_match": {
                "aggs": {
                    "_match": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "terms": {
                            "field": "a",
                            "include": ["x", "y"]
                        }
                    },
                    "_missing": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"or": [
                            {"missing": {"field": "a"}},
                            {"not": {"terms": {"a": ["x", "y"]}}}
                        ]}
                    }
                },
                "range": {
                    "field": "t",
                    "ranges": [
                        {
                            "from": 1497225600,
                            "to": 1497312000
                        },
                        {
                            "from": 1497312000,
                            "to": 1497398400
                        },
                        {
                            "from": 1497398400,
                            "to": 1497484800
                        },
                        {
                            "from": 1497484800,
                            "to": 1497571200
                        },
                        {
                            "from": 1497571200,
                            "to": 1497657600
                        },
                        {
                            "from": 1497657600,
                            "to": 1497744000
                        },
                        {
                            "from": 1497744000,
                            "to": 1497830400
                        }
                    ]
                }
            },
            "_missing": {
                "aggs": {
                    "_match": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "terms": {
                            "field": "a",
                            "include": ["x", "y"]
                        }
                    },
                    "_missing": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"or": [
                            {"missing": {"field": "a"}},
                            {"not": {"terms": {"a": ["x", "y"]}}}
                        ]}
                    }
                },
                "filter": {"or": [
                    {"or": [
                        {"range": {"t": {"lt": 1497225600}}},
                        {"range": {"t": {"gte": 1497830400}}}
                    ]},
                    {"missing": {"field": "t"}}
                ]}
            }
        },
        "size": 0
    }
maggienj commented 7 years ago

es1.7's "or" and "not" has been changed to es5.x', bool query with "should" and "must_not" clauses. Also, changed "missing" field, to "must_not" + exists() field. Theses changes were applied to one block for testing, and the query looks like the one shown below.

{"aggs": {
    "_match": {
        "aggs": {"v": {"sum": {"field": "v"}}},
        "filter": {"match_all": {}}
    },
    "_missing": {
        "aggs": {"v": {"sum": {"field": "v"}}},
        "filter": {"bool": {"should": [
            {"bool": {"must_not": {"exists": {"field": "a"}}}},
            {"bool": {"must_not": {"terms": {"a": ["x", "y"]}}}}
        ]}}
    }
}}

    {
        "aggs": {
            "_match": {
                "aggs": {
                    "_match": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"match_all": {}}
                    },
                    "_missing": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"bool": {"should": [
                            {"bool": {"must_not": {"exists": {"field": "a"}}}},
                            {"bool": {"must_not": {"terms": {"a": ["x", "y"]}}}}
                        ]}}
                    }
                },
                "range": {
                    "field": "t",
                    "ranges": [
                        {
                            "from": 1497312000,
                            "to": 1497398400
                        },
                        {
                            "from": 1497398400,
                            "to": 1497484800
                        },
                        {
                            "from": 1497484800,
                            "to": 1497571200
                        },
                        {
                            "from": 1497571200,
                            "to": 1497657600
                        },
                        {
                            "from": 1497657600,
                            "to": 1497744000
                        },
                        {
                            "from": 1497744000,
                            "to": 1497830400
                        },
                        {
                            "from": 1497830400,
                            "to": 1497916800
                        }
                    ]
                }
            },
            "_missing": {
                "aggs": {
                    "_match": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"match_all": {}}
                    },
                    "_missing": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"bool": {"should": [
                            {"bool": {"must_not": {"exists": {"field": "a"}}}},
                            {"bool": {"must_not": {"terms": {"a": ["x", "y"]}}}}
                        ]}}
                    }
                },
                "filter": {"bool": {"should": [
                    {
                        "default": null,
                        "lt": [1497312000, "t"]
                    },
                    {
                        "default": null,
                        "gte": [1497916800, "t"]
                    },
                    {"bool": {"must_not": {"exists": {"field": "t"}}}}
                ]}}
            }
        },
        "size": 0
    }

Now, a diff err , "[lt] query malformed"

ERROR: Bad Request: {"error":{"root_cause":[{"type":"parsing_exception","reason":"[lt] query malformed, no start_object after query name","line":1,"col":704}],"type":"parsing_exception","reason":"[lt] query malformed, no start_object after query name","line":1,"col":704},"status":400}

will check the equivalent of [lt] in es5.x

(just the bottom most part of the above query has some sort of a range using lt....without the "ranges" keyword... thats where the prob could be... ) maybe.... somewhere "ranges" keyword is missing in the above query or maybe, use a filtered query with rangefilter as its filter element...

maggienj commented 7 years ago
    if edge.allowNulls:    # TODO: Use Expression.missing().esfilter() TO GET OPTIMIZED FILTER
        missing_filter = set_default(
            {"filter":
                 {"bool": {"should": [
                     {"range": {InequalityOp("lt", [edge.value, Literal(None, to_float(_min))]),
                     InequalityOp("gte", [edge.value, Literal(None, to_float(_max))]),
                          {"bool": {"must_not": edge.value.exists().to_esfilter()}}}}
                          ]}}
            }, es_query )

added range here... for "lt" "gt" ops.... but, it appears the code handles "range" differently in the next section... so, may need to find an alt solution for applying range here...

maggienj commented 7 years ago

Now, it raises a diff err, a dict err.
Possibly it is because of this newly added...... "range" in the above query.

ERROR: unhashable type: 'dict'

caused by
    ERROR: unhashable type: 'dict'
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\es14\decoders.py", line 275, in _range_composer
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\es14\decoders.py", line 295, in append_query
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\es14\aggs.py", line 330, in es_aggsop
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\jx_usingES.py", line 157, in query
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\jx.py", line 71, in run
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\actions\jx.py", line 62, in jx_query
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\__init__.py", line 54, in output
maggienj commented 7 years ago

filter query--->bool--->should--->range

currently, "range" is within "bool"-->"should". Is "range" allowed within "bool"--->should? may need to check that....

maggienj commented 7 years ago

removed the bool with should in range.... so, the code now looks like..

    if edge.allowNulls:    # TODO: Use Expression.missing().esfilter() TO GET OPTIMIZED FILTER
        missing_filter = set_default(
            {"filter": {
                     InequalityOp("lt", [edge.value, Literal(None, to_float(_min))]),
                               InequalityOp("gte", [edge.value, Literal(None, to_float(_max))]),
                               {"bool": {"must_not": edge.value.exists().to_esfilter()}}}},
             es_query)

still modifying this code....

maggienj commented 7 years ago

now, changed this section to...

if edge.allowNulls:    # TODO: Use Expression.missing().esfilter() TO GET OPTIMIZED FILTER
        missing_filter = set_default(
            {"filter": { "bool": { "should": {
                     InequalityOp("lt", [edge.value, Literal(None, to_float(_min))]),
                               InequalityOp("gte", [edge.value, Literal(None, to_float(_max))]),
                               {"bool": {"must_not": edge.value.exists().to_esfilter()}}}}
                        }
            },
             es_query)

err is

Main Thread - "__init__.py:32" (send_error) - WARNING: Could not process
{"meta": {"testing": true}, "from": "testing_000_g", "select": {"aggregate": "sum", "value": "v"}, "edges": ["a", {"domain": {"max": "today", "interval": "day", "type": "time", "min": "today-week"}, "value": "t"}], "format": "list"}
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\actions\__init__.py", line 32, in send_error
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\actions\jx.py", line 100, in jx_query
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\__init__.py", line 54, in output

ERROR: unhashable type: 'dict'
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\es14\decoders.py", line 274, in _range_composer
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\es14\decoders.py", line 302, in append_query
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\es14\aggs.py", line 330, in es_aggsop
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\jx_usingES.py", line 157, in query
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\jx.py", line 71, in run
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\actions\jx.py", line 62, in jx_query
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\__init__.py", line 54, in output
klahnakoski commented 7 years ago

Look at

File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\queries\es14\decoders.py", line 274, in _range_composer

Check the code carefully: unhashable type: 'dict' can result from a dictionary (like {"a":1}) inside of a set (like {}). You get the same error with {{"a":1}}; you probably have extra curly braces around a dict.

klahnakoski commented 7 years ago

It is easier to discuss code if you commit and push your issue branch, and make a pull request. Then you can see your net changes, and discuss specific lines that are causing a problem. As you make more changes, and push, the pull request will be updated.

klahnakoski commented 7 years ago

Plus, I am able to pull your (incomplete) code to get the same error and diagnose the problem.

maggienj commented 7 years ago

agreed. committed and pushed...

maggienj commented 7 years ago

After some mods.... here is the generated query. *How should the correct generated query look like? ( not sure of how it should look like.... in order to tweak the "query generator" )

This is how the current query looks like....

Err: It looks like it is selecting "v" in the query.... but in the final output list, it is not displaying "v" values....

committed and pushed.

    {
        "aggs": {
            "_match": {
                "aggs": {
                    "_match": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"match_all": {}}
                    },
                    "_missing": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"bool": {"should": [
                            {"bool": {"must_not": {"exists": {"field": "a"}}}},
                            {"bool": {"must_not": {"terms": {"a": ["x", "y"]}}}}
                        ]}}
                    }
                },
                "range": {
                    "field": "t",
                    "ranges": [
                        {
                            "from": 1497398400,
                            "to": 1497484800
                        },
                        {
                            "from": 1497484800,
                            "to": 1497571200
                        },
                        {
                            "from": 1497571200,
                            "to": 1497657600
                        },
                        {
                            "from": 1497657600,
                            "to": 1497744000
                        },
                        {
                            "from": 1497744000,
                            "to": 1497830400
                        },
                        {
                            "from": 1497830400,
                            "to": 1497916800
                        },
                        {
                            "from": 1497916800,
                            "to": 1498003200
                        }
                    ]
                }
            },
            "_missing": {
                "aggs": {
                    "_match": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"match_all": {}}
                    },
                    "_missing": {
                        "aggs": {"v": {"sum": {"field": "v"}}},
                        "filter": {"bool": {"should": [
                            {"bool": {"must_not": {"exists": {"field": "a"}}}},
                            {"bool": {"must_not": {"terms": {"a": ["x", "y"]}}}}
                        ]}}
                    }
                },
                "filter": {"bool": {"should": [
                    {"range": {"t": {
                        "gte": 1498003200,
                        "lt": 1497398400
                    }}},
                    {"bool": {"must_not": {"exists": {"field": "t"}}}}
                ]}}
            }
        },
        "size": 0
    }
maggienj commented 7 years ago

Continuation from the above.. Because, it is producing this output list.... Not sure, what happened to "v" property in the output list as it appears to be missing...

"data": [
        {
            "a": "x",
            "t": 1497398400
        },
        {
            "a": "x",
            "t": 1497484800
        },

instead of.... this...

   "data": [
        {
            "a": "x",
            "t": 1497398400,
            "v": null
        },
        {
            "a": "x",
            "t": 1497484800,
            "v": null
        },
klahnakoski commented 7 years ago

Use ES head to see the result of the query. Once you have confirmed the result is correct, then we can review the code that builds up the "data": [] you showed me.

klahnakoski commented 7 years ago

I noticed this change:

-            {"filter": {"or": [
-                OrOp("or", [
-                    InequalityOp("lt", [edge.value, Literal(None, to_float(_min))]),
-                    InequalityOp("gte", [edge.value, Literal(None, to_float(_max))]),
-                ]).to_esfilter(),
-                edge.value.missing().to_esfilter()
-            ]}},
-            es_query
-        )
+            {"filter": { "bool": { "should": [
+                                    {"range": { "t": {
+                                        "lt": to_float(_min),
+                                        "gte":  to_float(_max)}}},
+                                        {"bool": {"must_not": edge.value.exists().to_esfilter()}}]
+                                 }
+                        }
+            },
+             es_query)

You removed expressions (OrOp, InequalityOp) for their ES versions of the same. Each *Op can emit its ElasticSearch expression that means the same thing. Maybe we can let these operators write out the correct ES filter for us. But first we must fix them:

Here is some code from OrOp:

def to_esfilter(self):
    return {"or": [t.to_esfilter() for t in self.terms]}

Change this code to use "bool.should", like everywhere else, then you can revert back to the code I mentioned a the top of this comment.

Furthermore, everywhere you see {"or" : []} you could replace with OrOp("or", []).to_esfilter(); and the operator will write the "bool.should" code for you.

maggienj commented 7 years ago

Trying to bring back the functions and change the bool:should at the function level. Errs: "Expecting an expression". *committed and pushed

    def to_esfilter(self):
        return {"bool": {"should": [t.to_esfilter() for t in self.terms]}}
missing_filter = set_default(
            {"filter":
                 {OrOp("or",
                        [
                        OrOp("or",[
                            InequalityOp("lt", [edge.value, Literal(None, to_float(_min))]),
                            InequalityOp("gte", [edge.value, Literal(None, to_float(_max))]),
                            ]).to_esfilter(),
                        {"bool": {"must_not": edge.value.exists().to_esfilter()}}
                        ]).to_esfilter()
                 }
            },
            es_query
        )
maggienj commented 7 years ago

Trying to change "missing" at the function level in expressions.py, instead of at the decoders.py level. missingOp will now have "bool": "must_not": "exists": "field": fieldname

   def to_esfilter(self):
        if isinstance(self.expr, Variable):
            return {"bool": {"must_not": {"exists":  {"field": self.expr.var}}}}
klahnakoski commented 7 years ago

"Expecting an expression" means the expression constructors expect to be given expressions, not esfilters. The .to_esfilter() should only be called, except on the topmost operator.

{"bool": {"must_not": edge.value.exists().to_esfilter()}}

gets reduced to

edge.value.missing()

because it is in the OrOp

klahnakoski commented 7 years ago

be sure to push your changes from the last session so I may review

klahnakoski commented 7 years ago

Looking at the test test_time2_variables we can see that it does not use limit==0, so all the code manipulation we are doing for limit==0 is necessary so we do not pass size=0 in a terms query, but it is not affecting the output for this test.

The next step is to confirm, or deny, the tuples coming out of aggs_iterator() are correct.

klahnakoski commented 7 years ago

please make a pull request for this issue so I can point to code

This test does not have a limit==0, yet it branches on self.limit == 0. The problem is aggs.py around line 109; limit=0 should be set to None:

-    limit = 0
-    output[max_depth].append(AggsDecoder(edge, query, limit))
+    output[max_depth].append(AggsDecoder(edge, query, limit=None))

Remove all code we added that deal with limit==0, including "_all" and make a pull request.

maggienj commented 7 years ago

Pull request completed for this issue. unittest - test_time_domain.TestTime.test_time2_variables passed. Closing this issue.