metatron-app / metatron-discovery

Powerful & Easy way for big data discovery
https://metatron.app
Apache License 2.0
440 stars 110 forks source link

Real-time Data ingestion; check ingested data #2024

Open wm9947 opened 5 years ago

wm9947 commented 5 years ago

https://metatron.app/2018/08/02/visualize-real-time-data-with-metatron-discovery/ I use this post to real-time ingest.

My python code is slightly different from what you provide

import sys
import json
import math
from datetime import datetime
from time import sleep
from kafka import KafkaProducer

#producer = KafkaProducer(bootstrap_servers='localhost:9092')

producer = KafkaProducer(value_serializer=lambda v: json.dumps(v).encode('utf-8'))

list_category = ['10', '20', '30'];
for i in range(0, 100000) :
        for cur_cate in list_category :
                cur_result = {};
        cur_result['category'] = cur_cate;
                cur_result['value_01'] = int(math.sin(i/float(cur_cate))*100)*1;
                cur_result['value_02'] = int(math.sin(i/float(cur_cate))*100)*2;
                cur_result['value_03'] = int(math.sin(i/float(cur_cate))*100)*3;
                cur_result['timestamp'] = datetime.strftime(datetime.utcnow(), "%Y-%m-%dT%H:%M:%SZ");

                print(json.dumps(cur_result) );
        producer.send('realtime_sample', cur_result );
                sys.stdout.flush();
        producer.flush();
        sleep(1)

I successfully ingest data to Metatron-discovery and possible to make a real-time dashboard as you describe. image But when I want to see the ingested data which is provided in datasource page, the error message has been shown.

Could you check it?

Many thanks,

kyungtaak commented 5 years ago

@wm9947 This is a formatting issue internally. Would you like to change the time format to workaround as shown below?

kyungtaak commented 5 years ago

@wm9947 Alternatively, you can change your time format in the api request as follows: yyyy-MM-ddTHH:mm:ssZ -> yyyy-MM-dd'T'HH:mm:ssZ

※ The post was modified.

wm9947 commented 5 years ago

Your solutions are works well, and can check datasource detail. But,,,

image As a image the timestamp is not shown, and the timestamp column from the downloaded csv file show the column is undefine(by text).

Many thanks,

kyungtaak commented 5 years ago

@wm9947 Hmm.. Can I see the data from the generator?

wm9947 commented 5 years ago

@kyungtaak

{"category": "30", "timestamp": "2019-05-08T05:34:35Z", "value_02": 150, "value_03": 225, "value_01": 75}
{"category": "10", "timestamp": "2019-05-08T05:34:36Z", "value_02": 94, "value_03": 141, "value_01": 47}
{"category": "20", "timestamp": "2019-05-08T05:34:36Z", "value_02": -192, "value_03": -288, "value_01": -96}
{"category": "30", "timestamp": "2019-05-08T05:34:36Z", "value_02": 154, "value_03": 231, "value_01": 77}
{"category": "10", "timestamp": "2019-05-08T05:34:37Z", "value_02": 76, "value_03": 114, "value_01": 38}
{"category": "20", "timestamp": "2019-05-08T05:34:37Z", "value_02": -196, "value_03": -294, "value_01": -98}
{"category": "30", "timestamp": "2019-05-08T05:34:37Z", "value_02": 158, "value_03": 237, "value_01": 79}
{"category": "10", "timestamp": "2019-05-08T05:34:38Z", "value_02": 56, "value_03": 84, "value_01": 28}
{"category": "20", "timestamp": "2019-05-08T05:34:38Z", "value_02": -196, "value_03": -294, "value_01": -98}
{"category": "30", "timestamp": "2019-05-08T05:34:38Z", "value_02": 162, "value_03": 243, "value_01": 81}
{"category": "10", "timestamp": "2019-05-08T05:34:39Z", "value_02": 36, "value_03": 54, "value_01": 18}
{"category": "20", "timestamp": "2019-05-08T05:34:39Z", "value_02": -198, "value_03": -297, "value_01": -99}
{"category": "30", "timestamp": "2019-05-08T05:34:39Z", "value_02": 166, "value_03": 249, "value_01": 83}

This is what I generate from python code.

kyungtaak commented 5 years ago

@wm9947 The timestamp as a column name is internally reserved. Therefore, there seems to be a problem with the processing of reserved words. :( Can you change the name "timestamp" to something else in generator and api request body?

The part of reserved words will be treated as a separate issue.

※ The post was modified again.

wm9947 commented 5 years ago

I try to use another generated data such as...

string, timestamp, double(Measure), string, string

When I try to ingest 5 columns which are only one columns is Measure type and others are Dimension, Each row is mostly same and can be ingested again.

But in this case, the Measure row has been automatically merged. ex) The measure column contain the number of the row such as 1,2,3,4,5,..... Other columns are same data. -> The ingested result in the Metatron discovery show a column which is the Measure data has been summed.

If I change to Dimension from measure, rows are separated even ingest the same data.

Sorry for not providing capture file. If you cannot understand what I mean, please let me know.

Many thanks,

kyungtaak commented 5 years ago

If you cannot understand what I mean, please let me know.

@wm9947 This is an issue of an option called rollup.

The concept of "rollup" is based on druid. Druid can summarize raw data at processing time using roll-up options. A rollup is a primary aggregation operation on a selected set of columns that reduces the size of the stored segment. We also use the roll-up option to improve the performance of some query operations. However, if the data in each row is meaningful, you can set the rollup option to false and ingest. In fact, most usability is in this case, so we changed the default to false as shown below. image

In api, you can set up as follows.

...
"ingestion": {
        "type": "realtime",
        ...,
        "rollup": true
    }
...

Regarding "rollup", It would be better to check the contents of the link below.

wm9947 commented 5 years ago

@kyungtaak Thanks for your kind :) I just miss to set the rollup option.

Additionally, I got some information about type when I make datasource by json POST. Such as, DataType : WKT, BOOLEAN, NUMBER, UNKNOWN, TEXT, DECIMAL, STRUCT, TIMESTAMP, ARRAY, FLOAT, INTEGER, STRING, MAP, DOUBLE, LONG logicalType: POSTAL_CODE, GEO_POINT, HTTP_CODE, LNG, SEX, BOOLEAN, NUMBER, GEO_POLYGON, DISTRICT, URL, DOUBLE, LNT, NIN, STRUCT, TIMESTAMP, ARRAY, PHONE_NUMBER, MAP_KEY, INTEGER, GEO_LINE, MAP_VALUE, STRING, IP_V4, EMAIL, CREDIT_CARD

Do you have any document for these format?

I try to ingest WKT point data, but I don't know which format is possible ex) 10.111,20,111 or POINT(10.111 20,111) or (10.111 20,111) or (10.111, 20,111)

Many Thanks,

kyungtaak commented 5 years ago

@wm9947 First, a description of the WKT representation is given here : https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry That is, "POINT(10.111 20.111)" is the correct WKT.

You can set datatype, logical type for geometry related columns as below.

{
    ...,
    "fields": [
        ...,
        {
            "name": "point_location",
            "type": "STRING",
            "logicalType": "GEO_POINT",  // WKT : POINT
            "role": "DIMENSION",
            "seq": 1
        },
        // or
        {
            "name": "line_location",
            "type": "STRING",
            "logicalType": "GEO_LINE",   // WKT : LINESTRING, MULTILINESTRING
            "role": "DIMENSION",
            "seq": 1
        },
        // or
        {
            "name": "polygon_location",
            "type": "STRING",
            "logicalType": "GEO_POLYGON",  // WKT : MULTILINESTRING, POLYGON, MULTIPOLYGON
            "role": "DIMENSION",
            "seq": 1
        }
    ]
}

※ More detailed data type related information will be described in the api document.

wm9947 commented 5 years ago

@kyungtaak I use

{
            "name": "Poin",
            "type": "STRING",
            "logicalType": "GEO_POINT", 
            "role": "DIMENSION",
            "seq": 4
        },

My code generate the string as blow

'Poin': 'POINT (33.1000000000000001 100.2020000000000001)' It generated by cur_result['Poin'] = wkt.dumps({'type': 'Point', 'coordinates': [la, lo ] } );

But the column data is missing. image

Many thanks,

kyungtaak commented 5 years ago

@wm9947 I'm sorry for the late reply.;; There is one thing I missed. You must add the "format" property as shown below.

    {
            "name": "GeoPoint",
            "type": "STRING",
            "logicalType": "GEO_POINT",
            "role": "DIMENSION",
            "seq": 1,
            "format": {
                "type": "geo_point"
            }
     }
ximik3 commented 3 years ago

@wm9947 I'm sorry for the late reply.;; There is one thing I missed. You must add the "format" property as shown below.

    {
            "name": "GeoPoint",
            "type": "STRING",
            "logicalType": "GEO_POINT",
            "role": "DIMENSION",
            "seq": 1,
            "format": {
                "type": "geo_point"
            }
     }

@kyungtaak I have the same issue. Data format example:

{"code":"LA","coords":"POINT (51.42205894062455 13.747762979057024)","country":"Lao People’s Democratic Republic","lat":51.42205894062455,"lng":13.747762979057024,"point":"POINT (18.0 105.0)"}
{"code":"MD","coords":"POINT (47.666765962573955 28.766420312548796)","country":"Republic of Moldova","lat":47.666765962573955,"lng":28.766420312548796,"point":"POINT (47.25 28.58333)"}
{"code":"KH","coords":"POINT (48.5273681970357 10.11163636642112)","country":"Kingdom of Cambodia","lat":48.5273681970357,"lng":10.11163636642112,"point":"POINT (13.0 105.0)"}
{"code":"CZ","coords":"POINT (49.52874343867825 15.736932294356984)","country":"Czechia","lat":49.52874343867825,"lng":15.736932294356984,"point":"POINT (49.75 15.0)"}
{"code":"PL","coords":"POINT (50.583994227408205 22.263452104234823)","country":"Republic of Poland","lat":50.583994227408205,"lng":22.263452104234823,"point":"POINT (52.0 20.0)"}

Ingestion without "format": { "type": "geo_point" } gives:

"event_time","lat","lng","coords","country","code","point"
"2020-10-19T09:53:31+0000","47.35356695410582","22.172929617427158","undefined","Hungary","HU","POINT ( )"
"2020-10-19T09:53:31+0000","46.277530657010686","20.106731262627182","undefined","Hungary","HU","POINT ( )"
"2020-10-19T09:53:31+0000","47.6556328533538","11.093962729708942","undefined","Principality of Liechtenstein","LI","POINT ( )"
"2020-10-19T09:53:31+0000","48.477095480336445","13.16176565938561","undefined","Republic of Austria","AT","POINT ( )"
"2020-10-19T09:53:31+0000","48.24618466276453","14.400907426432358","undefined","Republic of Austria","AT","POINT ( )"
"2020-10-19T09:53:31+0000","51.31381653688641","26.551735998073084","undefined","Republic of Belarus","BY","POINT ( )"
"2020-10-19T09:53:31+0000","49.85165022604589","27.74231184335249","undefined","Republic of Moldova","MD","POINT ( )"
"2020-10-19T09:53:31+0000","47.91721363600633","27.319068781645235","undefined","Republic of Moldova","MD","POINT ( )"
"2020-10-19T09:53:31+0000","50.55432969445604","22.70753745067759","undefined","Republic of Poland","PL","POINT ( )"
"2020-10-19T09:53:31+0000","50.91003915720814","21.90844270496809","undefined","Republic of Poland","PL","POINT ( )"

Ingestion with a "format": { "type": "geo_point" } fails with:

2020-10-19 12:36:41.306 ERROR [127.0.0.1-admin] [http-nio-8180-exec-6] a.m.d.c.exception.RestExceptionHandler   : [API:/api/datasources] GB0001 null: NullPointerException: 
app.metatron.discovery.common.exception.UnknownServerException
    at app.metatron.discovery.common.exception.RestExceptionHandler.handleMiscFailures(RestExceptionHandler.java:96)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Any additional ideas?