Open jaffrinkirthiga96 opened 4 years ago
Doing R&D on Apache-Kafka to stream the live data into Druid.
Have connected the MySQL data through Kafka into Druid,. But the data format is coming wrong. Will check and resolve it.
After syncing the data with Mysql and Druid through Kafka, I observed that the data is duplicated into druid. Removing the duplicates from Druid through schema is not acheivable after referrring to doc.
So I am trying to acheive this through Confluent. I am right now facing an issue in this. I will try to resolve this. Also I have loaded the data from June20 to June 28 into druid DB.
I have pointed the Machines to below domains Production - cc.mie.one Dev - ccd.mie.one
I have acheived the live stream data into kafka topic pipeline through JDBC confluent source in my local. With the same config and settings I am not able to get the stream data of ECOMMERCEDB in kafka in Hetzner machines.
I have resolved the issue and able to connect the mysql data from ecomm remote machine to commandcenter through kafka now. But I am facing challenge while reading the big table like products_insights(has 6C + data). In Backend I am getting Err like No space left on device to read the complete data Kafka streams are querying the Db like below
I am facing the below challenges in live streaming the data into Druid. JDBC confluent Connector: curl -X POST -H "Content-Type: application/json" --data '{ "name": "jdbc-source-connector_druid", "config": { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "tasks.max": "2", "connection.url": "jdbc:mysql://116.203.124.171:3306/ECOMMERCEDB?useUnicode=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC", "connection.user": "ecomm", "connection.password": "P@ssWD^713$", "mode": "timestamp", "query": "select from ECOMMERCEDB.productsinsights", "topic.prefix": "druid", "timestamp.column.name": "created_at", "poll.interval.ms": 5000 } }' http://78.47.148.83:8083/connectors Case1: Here the mode "timestamp" is querying the DB fully for every time interval of 5secs. Hence it is not retrieving the data from DB and sometimes I am getting No space left err in /tmp file while writing into streams. Case2: When I change the mode to "bulk" and query like "select from ECOMMERCEDB.products_insights where created >= CURDATE()" , I am able to retrieve the currentdate data from DB but the data is getting duplicated and stored in Druid for every time interval of 5 secs. This inturn results to wrong data in Superset Charts.
As suggested by Karthik, I am making this ticket on hold for now and moving back to ToDo
To load live data from ECOMMERCEDB MySQL to Druid.