Closed maggienj closed 7 years ago
For some reason, the current generated query doesn't fetch the numbers 11 and 12 as part of the query result.
For it to work, it has to fetch 11 and 12 along with the numbers it is fetching now, in the number list.
Trying to figure out... which part of the query resulted in not fetching 11 and 12... as it is already fetching other numbers successfully...
caused by
ERROR: {"v": [
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
13
]} does not match expected {"v": [
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13
]}
caused by
ERROR: : (13 != 11 within 6 places)
File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 218, in assertAlmostEqualValue
File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 148, in assertAlmostEqual
File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 146, in assertAlmostEqual
File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 111, in assertAlmostEqual
File "C:\Users\user\PycharmProjects\ActiveData\mo_testing\fuzzytestcase.py", line 111, in assertAlmostEqual
File "C:\Users\user\PycharmProjects\ActiveData\tests\__init__.py", line 327, in compare_to_expected
File "C:\Users\user\PycharmProjects\ActiveData\tests\__init__.py", line 219, in send_queries
File "C:\Users\user\PycharmProjects\ActiveData\tests\__init__.py", line 150, in execute_tests
File "C:\Users\user\PycharmProjects\ActiveData\tests\test_jx\test_edge_1.py", lin
what does test_edge_limit_big mean? Trying to interpret this in simple logical terms...
data = [1,2,3,4,5,6,7,8,9,10,11,12,13] the max number as per the above set is 13. so, does it mean that it should list all the numbers up until 13?
why do we have extended_stats here... to show sum and count... isn't it sufficient to just get the max for this test? ( i do see that extended stats collects several statistics such as sum, count, min, max etc..... but in this instance just max is sufficient... i guess..... may be it is there just as a container to collect all possible stats.... though only max is being used )
_missing section appears to define .... "if the field k doesn't exist" or "if the field k doesn't contain a,b,c,d,e,f,g,h,i,j "
for some reason, it doesn't go beyond "j". ( need to figure this out.... )
k,l,m contains 11,12,13....
two things to figure out...
what does test_edge_limit_big mean? Trying to interpret this in simple logical terms...
may be the query result is fetching only 10 values? as the default size is 10? because up until j , it looks like it is pulling 10 values..... as 11,12 and 13 goes beyond it.
but the generated query, has size set to 0.... so this can't be it too...
what is the es query?
The terms aggregation will take a size
parameter.
test_edge_limit_big
is testing a high limit
will return all the records
es_query looks like this...
{
"aggs": {
"_match": {
"aggs": {"v": {"extended_stats": {"field": "v"}}},
"terms": {
"field": "edg",
"include": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
}
},
"_missing": {
"aggs": {"v": {"extended_stats": {"field": "v"}}},
"filter": {"bool": {"should": [
{"bool": {"must_not": {"exists": {"field": "edg"}}}},
{"bool": {"must_not": {"terms": {"edg": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]}}}}
]}}
}
},
"size": 0
}
What code is responsible for building that list of include
?
in jx_elasticsearch\es14\decoders.py
#55 elif isinstance(e.value, Variable):
#56 schema = query.frum.schema
#57 col = schema[e.value.var][0] ------- this line appears to be the source for the #include list
At this point(line#57), value shows only upto "j"..... the below lines appear to make use of this....
#161 include = [p[key] for p in domain.partitions] --------------- then, it is set in the include
line#65
if col.partitions != None:
partitions = col.partitions[:limit:]
if e.domain.sort==-1:
partitions = list(reversed(sorted(partitions)))
else:
uses the above col.
good. the col is coming from the metadata. you may recall that ActiveData queries ES for the metadata concerning the columns; the number of unique values, and the what those unique values are. Since we made changes to that logic to get it running, I suspect there is a problem with that code; specifically, it is not gathering all the unique values, but limiting itself to 10.
From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
By default, the terms aggregation will return the buckets for the top ten terms ordered by the doc_count. One can change this default behaviour by setting the size parameter.
This line appears to overwrite that value in size property.
jx_elasticsearch\es14\aggs.py
line #102
limit = None
output[max_depth].append(AggsDecoder(edge, query, limit))
Just created a new branch(off of es5) and simply added the size property with cardinality as value and it raised errs.
looks like limit has to be set to the value of cardinality... trying to figure out how to pass the value of cardinality(or some other variable/value to limit the value instead of None ) in this line....
limit = 1000
output[max_depth].append(AggsDecoder(edge, query, limit))
jx_elasticsearch\es14\aggs.py
modded these lines to accept cardinality as limit.... as it appears to be used in the "size" property value in terms-size.
line#42:
cardinality = schema.columns[0].cardinality
line#102:
limit = cardinality
output[max_depth].append(AggsDecoder(edge, query, limit))
test passed. ready to merge. pull_request created. https://github.com/klahnakoski/ActiveData/pull/41
merged. closing this.
fix unittest tests.test_jx.test_edge_1.TestEdge1.test_edge_limit_big