maggienj / ActiveData

Provide high speed filtering and aggregation over data
Mozilla Public License 2.0
0 stars 0 forks source link

fix - tests.test_jx.test_edge_1.TestEdge1.test_edge_limit_big #52

Closed maggienj closed 7 years ago

maggienj commented 7 years ago

fix unittest tests.test_jx.test_edge_1.TestEdge1.test_edge_limit_big

maggienj commented 7 years ago

For some reason, the current generated query doesn't fetch the numbers 11 and 12 as part of the query result.

For it to work, it has to fetch 11 and 12 along with the numbers it is fetching now, in the number list.

Trying to figure out... which part of the query resulted in not fetching 11 and 12... as it is already fetching other numbers successfully...

caused by

    ERROR: {"v": [
     1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
    13
]} does not match expected {"v": [
     1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
    11, 12, 13
]}

caused by

    ERROR: : (13 != 11 within 6 places)
    File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 218, in assertAlmostEqualValue
    File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 148, in assertAlmostEqual
    File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 146, in assertAlmostEqual
    File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 111, in assertAlmostEqual
    File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 111, in assertAlmostEqual
    File "C:\Users\user\PycharmProjects\ActiveData\tests\__init__.py", line 327, in compare_to_expected
    File "C:\Users\user\PycharmProjects\ActiveData\tests\__init__.py", line 219, in send_queries
    File "C:\Users\user\PycharmProjects\ActiveData\tests\__init__.py", line 150, in execute_tests
    File "C:\Users\user\PycharmProjects\ActiveData\tests\test_jx\test_edge_1.py", lin
maggienj commented 7 years ago

what does test_edge_limit_big mean? Trying to interpret this in simple logical terms...

data = [1,2,3,4,5,6,7,8,9,10,11,12,13] the max number as per the above set is 13. so, does it mean that it should list all the numbers up until 13?

why do we have extended_stats here... to show sum and count... isn't it sufficient to just get the max for this test? ( i do see that extended stats collects several statistics such as sum, count, min, max etc..... but in this instance just max is sufficient... i guess..... may be it is there just as a container to collect all possible stats.... though only max is being used )

_missing section appears to define .... "if the field k doesn't exist" or "if the field k doesn't contain a,b,c,d,e,f,g,h,i,j "

for some reason, it doesn't go beyond "j". ( need to figure this out.... )

k,l,m contains 11,12,13....

two things to figure out...

  1. why doesn't it go beyond j? because, if it went beyond j, it would have pulled 11 and 12.
  2. or is the expectation list specified incorrectly? (it can't be... because never seen a scenario where the expectation list was incorrect. on top of it, all these unit tests passed in es 1.7. ) so, have to assume that the expectation_list is correct,, given the above scenario.
maggienj commented 7 years ago

what does test_edge_limit_big mean? Trying to interpret this in simple logical terms...

maggienj commented 7 years ago

may be the query result is fetching only 10 values? as the default size is 10? because up until j , it looks like it is pulling 10 values..... as 11,12 and 13 goes beyond it.

but the generated query, has size set to 0.... so this can't be it too...

klahnakoski commented 7 years ago

what is the es query?

klahnakoski commented 7 years ago

The terms aggregation will take a size parameter.

klahnakoski commented 7 years ago

test_edge_limit_big is testing a high limit will return all the records

maggienj commented 7 years ago

es_query looks like this...

    {
        "aggs": {
            "_match": {
                "aggs": {"v": {"extended_stats": {"field": "v"}}},
                "terms": {
                    "field": "edg",
                    "include": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
                }
            },
            "_missing": {
                "aggs": {"v": {"extended_stats": {"field": "v"}}},
                "filter": {"bool": {"should": [
                    {"bool": {"must_not": {"exists": {"field": "edg"}}}},
                    {"bool": {"must_not": {"terms": {"edg": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]}}}}
                ]}}
            }
        },
        "size": 0
    }
klahnakoski commented 7 years ago

What code is responsible for building that list of include?

maggienj commented 7 years ago

in jx_elasticsearch\es14\decoders.py

#55 elif isinstance(e.value, Variable):
#56               schema = query.frum.schema
#57                col = schema[e.value.var][0]  ------- this line appears to be the source for the #include list

At this point(line#57), value shows only upto "j"..... the below lines appear to make use of this....

#161 include = [p[key] for p in domain.partitions] --------------- then, it is set in the include

maggienj commented 7 years ago

line#65

if col.partitions != None:
                    partitions = col.partitions[:limit:]
                    if e.domain.sort==-1:
                        partitions = list(reversed(sorted(partitions)))
                    else:

uses the above col.

klahnakoski commented 7 years ago

good. the col is coming from the metadata. you may recall that ActiveData queries ES for the metadata concerning the columns; the number of unique values, and the what those unique values are. Since we made changes to that logic to get it running, I suspect there is a problem with that code; specifically, it is not gathering all the unique values, but limiting itself to 10.

klahnakoski commented 7 years ago

From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

By default, the terms aggregation will return the buckets for the top ten terms ordered by the doc_count. One can change this default behaviour by setting the size parameter.

maggienj commented 7 years ago

This line appears to overwrite that value in size property.
jx_elasticsearch\es14\aggs.py line #102

   limit = None
   output[max_depth].append(AggsDecoder(edge, query, limit))

Just created a new branch(off of es5) and simply added the size property with cardinality as value and it raised errs.

looks like limit has to be set to the value of cardinality... trying to figure out how to pass the value of cardinality(or some other variable/value to limit the value instead of None ) in this line....

     limit = 1000
     output[max_depth].append(AggsDecoder(edge, query, limit))
maggienj commented 7 years ago

jx_elasticsearch\es14\aggs.py

modded these lines to accept cardinality as limit.... as it appears to be used in the "size" property value in terms-size. line#42: cardinality = schema.columns[0].cardinality

line#102:

        limit = cardinality
        output[max_depth].append(AggsDecoder(edge, query, limit))
maggienj commented 7 years ago

test passed. ready to merge. pull_request created. https://github.com/klahnakoski/ActiveData/pull/41

maggienj commented 7 years ago

merged. closing this.