metatron-app / metatron-discovery

Powerful & Easy way for big data discovery
https://metatron.app
Apache License 2.0
440 stars 110 forks source link

Return value of 'COUNTD' function does not match actual cardinality. #967

Closed Taehui closed 5 years ago

Taehui commented 5 years ago

Describe the bug

To Reproduce

Expected behavior

Desktop (please complete the following information):

kyungtaak commented 5 years ago

@Taehui 현재는 count distinct 를 구할때 Cardinality aggregator 를 사용합니다. (http://druid.io/docs/latest/querying/aggregations) 원래 오픈소스 준비 당시 좀더 정확한 DataSketches aggregator를 사용하였지만, 3.0 내 포팅하지 않은 상태인데요. 이 DataSketches aggregator (http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html) 자체도 아래와 같이 명시하고 있습니다. 이번 이슈에서는 DataSketches aggregator 를 적용하는 것을 먼저 진행하고 정확도가 떨어진다면, 추가로 검토하도록 하겠습니다.

Note that sketch algorithms are approximate;

Taehui commented 5 years ago

@kyungtaak Cardinality aggregator 라는 것은 approximately 방식 인가요? 일단 빠르고 느리고를 떠나서 값이 같아야 될 것 같은데요..

alchan-lee commented 5 years ago

Datasketches aggregator로 쿼리를 날리면 아래와 같은 NPE 에러가 나고 있습니다.

java.lang.NullPointerException
    at io.druid.query.aggregation.datasketches.theta.SketchEstimatePostProcessor$1$1.apply(SketchEstimatePostProcessor.java:94) ~[?:?]
    at io.druid.query.aggregation.datasketches.theta.SketchEstimatePostProcessor$1$1.apply(SketchEstimatePostProcessor.java:87) ~[?:?]
    at com.metamx.common.guava.MappingYieldingAccumulator.accumulate(MappingYieldingAccumulator.java:57) ~[java-util-1.3.3.jar:?]
    at com.metamx.common.guava.BaseSequence.makeYielder(BaseSequence.java:105) ~[java-util-1.3.3.jar:?]
    at com.metamx.common.guava.BaseSequence.toYielder(BaseSequence.java:82) ~[java-util-1.3.3.jar:?]
    at com.metamx.common.guava.MappedSequence.toYielder(MappedSequence.java:46) ~[java-util-1.3.3.jar:?]
    at com.metamx.common.guava.ResourceClosingSequence.toYielder(ResourceClosingSequence.java:52) ~[java-util-1.3.3.jar:?]
    at io.druid.server.QueryResource.doPost(QueryResource.java:300) [druid-server-0.9.1-SNAPSHOT.jar:0.9.1-SNAPSHOT]
    at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_171]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_171]

실제로 수행한 스펙은 아래와 같습니다.

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "sales"
  },
  "granularity": "all",
  "intervals": [
    "1970-01-01T00:00:00.0Z/2051-01-01T00:00:00.0Z"
  ],
  "virtualColumns": [],
  "dimensions": [
    {
      "type": "default",
      "dimension": "Country",
      "outputName": "Country"
    }
  ],
  "groupingSets": {
    "type": "names",
    "names": []
  },
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "expr",
        "expression": "in(time_format(__time,out.format='MMM yyyy',out.timezone='UTC',out.locale='en'), 'Jan 2011', 'Feb 2011')"
      }
    ]
  },
  "aggregations": [
    {
      "type": "thetaSketch",
      "name": "MEASURE_1",
      "fieldName": "City",
      "size": 65536,
      "shouldFinalize": false
    }
  ],
  "postAggregations": [],
  "limitSpec": {
    "type": "default",
    "windowingSpecs": [
      {
        "partitionColumns": [],
        "pivotSpec": {
          "separator": "―",
          "tabularFormat": true,
          "appendValueColumn": true,
          "valueColumns": [
            "MEASURE_1"
          ],
          "pivotColumns": [
            {
              "dimension": "Country",
              "direction": "ascending",
              "dimensionOrder": "alphanumeric"
            }
          ]
        }
      }
    ],
    "limit": 1000,
    "columns": []
  },
  "context": {
    "postProcessing": {
      "type": "sketch.estimate"
    }
  }
}

@kyungtaak @metatron-app/engine 관련해서 limitSpec을 빼면은 정상적으로 쿼리가 수행되는 것을 확인 했는데요, 혹시 어떤 부분을 손봐야 할 지 조언을 얻을 수 있을까요 😅

[ {
  "version" : "v1",
  "timestamp" : "1970-01-01T00:00:00.000Z",
  "event" : {
    "Country" : "United States",
    "MEASURE_1" : 48.0,
    "MEASURE_1.estimation" : false
  }
} ]
navis commented 5 years ago

수정했습니다 (#1020)