maggienj / ActiveData

Provide high speed filtering and aggregation over data
Mozilla Public License 2.0
0 stars 0 forks source link

fix - test_jx.test_expressions_w_set_ops.TestSetOps.test_select_when_on_multivalue #51

Open maggienj opened 7 years ago

maggienj commented 7 years ago

fix unit test test_jx.test_expressions_w_set_ops.TestSetOps.test_select_when_on_multivalue

maggienj commented 7 years ago

err msg is... "type":"illegal_argument_exception","reason":"Extraneous conditional statement.

one other msg... "type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards": ( this err msg may not be applicable, because we do see shards available and the es cluster is up and running... still, not sure why this msg would pop up )

caused by
caused by
    ERROR: Internal Server Error: {"error":{"root_cause":[{"type":"script_exception","reason":"compile error","script_stack":["(!((false)?(doc[\"a\"].values. ...","   ^---- HERE"],"script":"(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)","lang":"painless"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"testing_000_n20170713_200436","node":"OX5gNnXQRtW6gThYBtxJ4w","reason":{"type":"script_exception","reason":"compile error","caused_by":{"type":"illegal_argument_exception","reason":"Extraneous conditional statement."},"script_stack":["(!((false)?(doc[\"a\"].values. ...","   ^---- HERE"],"script":"(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)","lang":"painless"}}],"caused_by":{"type":"script_exception","reason":"compile error","caused_by":{"type":"illegal_argument_exception","reason":"Extraneous conditional statement."},"script_stack":["(!((false)?(doc[\"a\"].values. ...","   ^---- HERE"],"script":"(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)","lang":"painless"}},"status":500}
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\env\elasticsearch.py", line 755, in post
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\env\elasticsearch.py", line 755, in post
    File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\env\elasticsearch.py", line 1090, in search
    File "C:\Users\user\PycharmProjects\ActiveData\jx_elasticsearch\es09\util.py", line 40, in post
    File "C:\Users\user\PycharmProjects\ActiveData\jx_elasticsearch\es14\setop.py", line 194, in extract_rows
    File "C:\Users\user\PycharmProjects\ActiveData\jx_elasticsearch\es14\setop.py", line 64, in es_setop
    File "C:\Users\user\PycharmProjects\ActiveData\jx_elasticsearch\jx_usingES.py", line 160, in query
    File "C:\Users\user\PycharmProjects\ActiveData\jx_python\jx.py", line 71, in run
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\actions\jx.py", line 62, in jx_query
    File "C:\Users\user\PycharmProjects\ActiveData\active_data\__init__.py", line 54, in output
maggienj commented 7 years ago

this is how the current generated query looks like.
trying to figure these out...... in the first sight, does it look correct? how should the corrected query look like...

ERROR: Problem with search (path=/testing_000_n/test_result/_search):

{
    "from": 0,
    "query": {"bool": {"filter": {"match_all": {}}}},
    "script_fields": { 
          "is_c": {"script": "((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"c\"))) ? (1) : (0)"},
          "is_e": {"script": "((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\"))) ? (1) : (0)"},
          "not_e": {"script": "(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)"}
 },
   "size": 10,
   "stored_fields": ["a"]
}
maggienj commented 7 years ago

also, another generated query in the same log...


ERROR: Problem with call to http://localhost:9200/testing_000_x/test_result/_search
{"query": {"bool": {"filter": {"match_all": {}}}}, "stored_fields": ["a"], "from": 0, "script_fields": {"not_e": {"script": "(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)"}, "is_e": {"script": "((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\"))) ? (1) : (0)"}, "is_c": {"script": "((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"c\"))) ? (1) : (0)"}}, "size": 10}
maggienj commented 7 years ago
"script_stack":[
"(!((false)?(doc["a"].values. ...",
"   ^---- HERE"],"script":"(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)","lang":"painless"}}],"caused_by":{"type":"script_exception","reason":"compile error","caused_by":{"type":"illegal_argument_exception","reason":"Extraneous conditional statement."},"script_stack":["(!((false)?(doc[\"a\"].values. ...","   ^---- HERE"],"script":"(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)","lang":"painless"}},"status":500}

"(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)"}
"   ^---- HERE"]    

(no errs in the below line.... it shows err in the above line only.., the above line has a notOperator (!) .... may need to check the data types and data type matching in either sides of the ternary operator....
because, datatypes should match in either sides for ternary operator as per ES5. )

false is boolean datatype....so it is ok... contains function returns a boolean .... so, it is ok too... 1 and 0 are integers... but it should be ok... because in certain context we do use 0 or 1 to represent true or false.

((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\"))) ? (1) : (0)"},
maggienj commented 7 years ago

prpbably it is just the not "!".... because it didnt raise errs in other lines.... so , it may not be related to boolean... so , may not need to touch anything other than the "!" not operator... will check es5 / painless syntax for "!" (not operator)...

maggienj commented 7 years ago

well, it doesn't seem to be related to "not operator " or !.

search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"testing_000_h20170713_213247","node":"OX5gNnXQRtW6gThYBtxJ4w","reason":{"type":"script_exception","reason":"compile error","caused_by":{"type":"illegal_argument_exception","reason":"Extraneous conditional statement."},"script_stack":["(!((false)?File "C:\Users\user\PycharmProjects\ActiveData\pyLibrary\env\elasticsearch.py", line 755, in post

klahnakoski commented 7 years ago

what does (doc[\"a\"].values.size()==0) ? (1) : (0) give? how about (false) ? (1) : (0) ?

It may be complaining about the (false)

does contains return a boolean?

maggienj commented 7 years ago
 "not_e": {
      "script": "(!((false)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"e\")))) ? (1) : (0)"
    }

all datatypes should be same in this line.... but.... not sure, why size==0 exists here.....

Logically, if the value is not equal to "e" then set the scripted_field, "not_e" to 1...... otherwise , set it to "0"

if we have to check for size > 0 for field's value, so that we don't get stackoverflow or comparison with null values, then, there should be another nested ternary operator..... so i'm trying to simplify this to understand... to fix this...

pseudo_code for inner condition: (checking for greater than 0 , instead of == 0 as that is how the loop flows in this version of pseudo code... )

if (doc[\"a\"].values.size() > 0) then   
       if  ((doc[\"a\"].values).contains(\"e\"))   then "not_e" = 1
       else:
       "not_e" = 0
else 
    "not_e" = 0

if we have to rewrite the above with ternary operator , then ...

(doc[\"a\"].values.size() > 0) ?   
       (   doc[\"a\"].values).contains(\"e\")   ?  1 : 0 )  
: 0 

so, if we look at this there are two "?"s and two ":"s.

will try to run a similar modded one... in es-head....

klahnakoski commented 7 years ago

ESv1.4 allowed doc[\"a\"].value to pull the value, but it only pulled the first value. doc[\"a\"].values pulls a an array of values; an empty list if there are no values. size() is a function that returns the length of an array. doc[\"a\"].values.size()==0 is the same as doc[\"a\"].empty.

klahnakoski commented 7 years ago
if false:
    temp = doc["a"].values.size()==0
else:
    temp = doc["a"].values.contains("e")
if not temp:
    return 1
else:
    return 0
maggienj commented 7 years ago

So, modded this as per the above pseudo-code and ran it in es-head. the below code snippet works fine ....

** we are supposed to make as minimal changes as possible... but i jut tried it as per my understanding to figure out what code would pass in es-head for this to work logically with es5 painless syntax

"script_fields": {
    "not_e": {
      "script": " ( (doc[\"a\"].values.size()>0)  ? (  doc[\"a\"].values.contains(\"e\")  ? 1 : 0) : 0 )"
    }
  }
maggienj commented 7 years ago

just saw the questions posted....

here are the answers for them... (doc[\"a\"].values.size()==0) ? (1) : (0) returns 0

(false) ? (1) : (0) as expected, (false) raises err... ( in one of the previous tests, we removed (false) or (true) in the condition... i guess.... as it raised some errs... that's why i kinda removed the (false) in the above queries... )

yes, contains returns a boolean.

klahnakoski commented 7 years ago

I tried this, to see if I can get the class name

{
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "match_all": {}
      }
    }
  },
  "script_fields": {
    "is_c": {
      "script": "doc[\"a\"].getClass().getName()"
    }
  },
  "size": 10,
  "stored_fields": [
    "a"
  ]
}

that did not work, but I did get the class name:

"Unable to find dynamic method [getClass] with [0] arguments for class [org.elasticsearch.index.fielddata.ScriptDocValues.Strings]."

which I googled to get here

https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/index/fielddata/ScriptDocValues.java#L555

follow the parent classes to java.util.AbstractList; https://docs.oracle.com/javase/7/docs/api/java/util/AbstractList.html and to the parent https://docs.oracle.com/javase/7/docs/api/java/util/Collection.html

to verify that org.elasticsearch.index.fielddata.ScriptDocValues.Strings is a Collection which has a contains() method that returns boolean

klahnakoski commented 7 years ago

OK, searched for "elasticsearch "Extraneous conditional statement."" to get here https://github.com/elastic/elasticsearch/blob/master/modules/lang-painless/src/main/java/org/elasticsearch/painless/node/EConditional.java#L67 which means Elasticsearch is being smart and looking for constants; and complaining if they exist.

klahnakoski commented 7 years ago

This is a terrible thing. We must perform constant propagation and dead code elimination to get around this.

maggienj commented 7 years ago

why do we need this (false) in this code-snippet here?

if false:
    temp = doc["a"].values.size()==0
else:
    temp = doc["a"].values.contains("e")
if not temp:
    return 1
else:
    return 0

I think that (false) is extraneous and that's why the err is... that;s why avoided that (false) in the previous tries....

So, without the (false) the logic would look like this..

pseudo_code-1

if (doc[\"a\"].values.size() > 0) then   
       if  ((doc[\"a\"].values).contains(\"e\"))   then "not_e" = 1
       else:
       "not_e" = 0
else 
    "not_e" = 0
@extend(WhenOp)
def to_painless(self, not_null=False, boolean=False, many=False):
    return "(" + self.when.to_painless(boolean=True) + ") ? (" + self.then.to_painless(not_null=not_null) + ") : (" + self.els_.to_painless(not_null=not_null) + ")"

will have to check if self.when.to_painless(boolean=True) is the one that is resulting in (false) and may be remove this extraneous condition check...

klahnakoski commented 7 years ago

The code looks dumb to a human, but the machine is just writing out expressions. Consider

{"eq": {"a": "c"}}  

which is the same as

EqOp("eq", [Variable("a"), Literal("c")])

when running this code:

    return "(" + rhs_missing + ")?(" + lhs_list + ".size()==0):((" + lhs_list + ").contains(" + rhs + "))"

the rhs is Literal("c"), so Literal("c").missing() is always false, which is put as the first parameter of the conditional. The rhs_missing could also be True or None, so we have to check those cases too.

(1==2)?(doc[\"a\"].values.size()==0):((doc[\"a\"].values).contains(\"c\"))) ? (1) : (0)

Has the same problem: Elasticsearch knows that 1 is a constant, 2 is a constant, therefore 1==2 is a constant, so throw an error. Our code must be smarter than the Elasticsearch code if we are going to trick it into running our painless scripts. Our code must perform the comparision and simplify the code based on the conditional.

maggienj commented 7 years ago

this script runs fine for "is_c" and "is_e"....
but raises err for "not_e" ...

{
  "from": 0,
  "query": {
    "bool": {
      "filter": {
        "match_all": {}
      }
    }
  },
  "script_fields": {
    "is_c": {
      "script": "(doc[\"a\"].values.size()==0 ? ( (doc[\"a\"].values).contains(\"c\" )   ?  (1) : (0 ) ):0)"
    },
    "is_e": {
      "script": "(doc[\"a\"].values.size()==0 ? ( (doc[\"a\"].values).contains(\"e\" )   ?  (1) : (0 ) ):0)"
    }
  },
  "size": 10,
  "stored_fields": [
    "a"
  ]
}

"not_e" raises err.. because, need to figure out which adds the notOperator "!" and then balance the parentheses for notOperator... ( it balances the param for the above scripted fields and they run fine... but not the below one... )

    "not_e": {
      "script": "(!(doc[\"a\"].values.size()==0 ? ( (doc[\"a\"].values).contains(\"e\" )  ) ?  (1) : (0 ) ):0)"
    }
klahnakoski commented 7 years ago

We will now trick painless to do what we want: Replace all instances of false with System.currentTimeMillis() < 0 and true with System.currentTimeMillis()>0.

klahnakoski commented 7 years ago

I added partial evaluation code

Referring to https://github.com/klahnakoski/ActiveData/blob/ca913a7852340933f9c4f79b8d28efa625ba7938/jx_elasticsearch/es52/expressions.py#L373

if boolean:
    return WhenOp(
        "when",
        self.rhs.missing(),
        **{"then": self.lhs.missing(), "else": InOp("in", [self.rhs, self.lhs])}
    ).partial_eval().to_painless(boolean=True)
else:
    return WhenOp(
        "when",
        OrOp("or", [self.rhs.missing(), self.lhs.missing()]),
        **{"then": NullOp(), "else": InOp("in", [self.rhs, self.lhs])}
    ).partial_eval().to_painless(boolean=True)

All expressions now have a partial_eval method to simplify literal expressions. This is not completed for all expressions, but it is done enough to pass the test. All the non-trivial to_painless code must construct an expression and run partial_eval. String concatenation will no longer work in all cases.

https://github.com/klahnakoski/ActiveData/blob/ca913a7852340933f9c4f79b8d28efa625ba7938/jx_base/expressions.py#L1551

def partial_eval(self):
    when = self.when.partial_eval()
    if isinstance(when, Literal):
        if isinstance(when, TrueOp):
            return self.then.partial_eval()
        elif isinstance(when, (FalseOp, NullOp)):
            return self.els_.partial_eval()
        else:
            Log.error("Expecting `when` clause to return a Boolean, or `null`")
    else:
        self.simplified = True
        return self

The common patterns to notice here: Each expression will look for Literal parameters that will help simplify the logic; there is always the case that the expression can not be constant-folded, so we return self; we set the simplified parameter so we do not go into logic loops.