headrun / SWIFT

2 stars 0 forks source link

Loading live stream data into Druid #123

Open jaffrinkirthiga96 opened 4 years ago

jaffrinkirthiga96 commented 4 years ago

To load live data from ECOMMERCEDB MySQL to Druid.

jaffrinkirthiga96 commented 4 years ago

Doing R&D on Apache-Kafka to stream the live data into Druid.

  1. Installed Apache-Kafka
  2. Loaded the sample data into Kafka and streamed the same in druid image Need some moretime to analyze how to live stream the MySQL data into Kafka pipeline directly.
jaffrinkirthiga96 commented 4 years ago

Have connected the MySQL data through Kafka into Druid,. But the data format is coming wrong. Will check and resolve it. image

jaffrinkirthiga96 commented 4 years ago

After syncing the data with Mysql and Druid through Kafka, I observed that the data is duplicated into druid. Removing the duplicates from Druid through schema is not acheivable after referrring to doc.

So I am trying to acheive this through Confluent. I am right now facing an issue in this. I will try to resolve this. Also I have loaded the data from June20 to June 28 into druid DB.

jaffrinkirthiga96 commented 4 years ago

I have pointed the Machines to below domains Production - cc.mie.one Dev - ccd.mie.one

jaffrinkirthiga96 commented 4 years ago

I have acheived the live stream data into kafka topic pipeline through JDBC confluent source in my local. With the same config and settings I am not able to get the stream data of ECOMMERCEDB in kafka in Hetzner machines.

jaffrinkirthiga96 commented 4 years ago

I have resolved the issue and able to connect the mysql data from ecomm remote machine to commandcenter through kafka now. But I am facing challenge while reading the big table like products_insights(has 6C + data). In Backend I am getting Err like No space left on device to read the complete data Kafka streams are querying the Db like below image

jaffrinkirthiga96 commented 4 years ago

I am facing the below challenges in live streaming the data into Druid. JDBC confluent Connector: curl -X POST -H "Content-Type: application/json" --data '{ "name": "jdbc-source-connector_druid", "config": { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "tasks.max": "2", "connection.url": "jdbc:mysql://116.203.124.171:3306/ECOMMERCEDB?useUnicode=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC", "connection.user": "ecomm", "connection.password": "P@ssWD^713$", "mode": "timestamp", "query": "select from ECOMMERCEDB.productsinsights", "topic.prefix": "druid", "timestamp.column.name": "created_at", "poll.interval.ms": 5000 } }' http://78.47.148.83:8083/connectors Case1: Here the mode "timestamp" is querying the DB fully for every time interval of 5secs. Hence it is not retrieving the data from DB and sometimes I am getting No space left err in /tmp file while writing into streams. Case2: When I change the mode to "bulk" and query like "select from ECOMMERCEDB.products_insights where created >= CURDATE()" , I am able to retrieve the currentdate data from DB but the data is getting duplicated and stored in Druid for every time interval of 5 secs. This inturn results to wrong data in Superset Charts.

jaffrinkirthiga96 commented 4 years ago

As suggested by Karthik, I am making this ticket on hold for now and moving back to ToDo