mardambey / mypipe

MySQL binary log consumer with the ability to act on changed rows and publish changes to different systems with emphasis on Apache Kafka.
http://mardambey.github.io/mypipe
Apache License 2.0
427 stars 81 forks source link

Snapshotting a large table #52

Open mbittmann opened 8 years ago

mbittmann commented 8 years ago

Hi,

I'm trying to snapshot a large table (~100 million rows) to kafka to bootstrap a replica of a mysql table on HDFS. I'm using the --no-transaction flag because I don't have FLUSH permissions on the database. First, I had to extend the timeout in the handleEvent method. Now, I'm running into the following garbage collection error:

Exception in thread "metrics-meter-tick-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "metrics-meter-tick-thread-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "metrics-meter-tick-thread-4" Exception in thread "shutdownHook1" java.lang.OutOfMemoryError: GC overhead limit exceeded

From what I can tell, it appears the entire table snapshot is contained within a single SelectEvent. This error occurs after a few minutes during the SelectConsumer.handleEvents() loop. Do you have any recommendations on how to get around the garbage collection issue? Thanks for all your work on this project!

mardambey commented 8 years ago

@mbittmann thanks for the feedback.

The current implementation is very naive in terms of handling large tables. I've been looking around similar projects to see how they handle this, and I like the way that Sqoop can split a table into multiple parts based on a split-by column. I'm going to implement similar functionality for mypipe soon unless someone else gets to it first (=

In the mean time, you can try giving the JVM more memory and see if that helps, although this is really a terrible and very temporary solution at best.

mbittmann commented 8 years ago

Thanks for the reply! That makes sense. I ended up going with sqoop to bootstrap the tables, which also has the advantage of bypassing Kafka. There were a few serialization issues to tackle with sql column types being mapped to different avro types, such as with timestamps and certain flavors of tinyint.

mardambey commented 8 years ago

Starting to make progress here, @mbittmann. See commit 63d1f43e0f1d025d511052e27a7e5b03e165a3bc 6aff568244026bea87e438f526dd3969a9a81536.