stepwise incremental data import from database into ES

jprante / elasticsearch-jdbc

JDBC importer for Elasticsearch

Apache License 2.0

2.84k stars 709 forks source link

stepwise incremental data import from database into ES #520

Open topcoder1 opened 9 years ago

topcoder1 commented 9 years ago

does ES river/feeder have a mechanism to feed data from database into ES via jdbc incrementally in a stepwise function? I have billions of records to be imported into ES from mysql and a select * from my table would take a long time and be interrupted. I need to do select * from my_table where my_id>start_id and my_id<end_id and iterate through somehow. feeder/river doesn't seem to provides such mechanism, or am I missing something?

jprante commented 9 years ago

Of course you can import billions of records from MySQL.

Streaming mode is enabled by default, but you can also execute incremental select statements, even in parallel.

Can you describe what you mean by "would take a long time and be interrupted"?

topcoder1 commented 9 years ago

so the current syntax for feeding from sql is: "sql" : [ { "statement": "select * from test" } ],

I want to naturally divide it up say 100,000 records at a time based on increasing record id, how can this be done? I'm thinking of something like the following where start and end would be adjusted accordingly by the river. This is something that's supported by sphinx search out of box. "sql" : [ { "statement": "select * from test where my_id>=$start and my_id<$end order by my_id asc", "increment": "100000" } ],

is this possible? thanks!

4/4/2015 3:22 PM J�rg Prante (notifications@github.com)

Of course you can import billions of records from MySQL.

Streaming mode is enabled by default, but you can also execute incremental select statements, even in parallel.

Can you describe what you mean by "would take a long time and be interrupted"?

Reply to this email directly or view it on GitHub: https://github.com/jprante/elasticsearch-river-jdbc/issues/520#issuecomment-89672514

jprante commented 9 years ago

The JDBC MySQL driver works different from Sphinx search MySQL C driver.

JDBC MySQL offers a high performance streaming mode: _The combination of a forward-only, read-only result set, with a fetch size of Integer.MINVALUE serves as a signal to the driver to stream result sets row-by-row. See http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html for further details.

This is the default setting for MySQL driver in JDBC plugin, for best out-of-the-box experience.

If you have trouble to get your MySQL data into Elasticsearch, I'm available for help.

I have received questions about Postgresql, Oracle, and MS SQL which do not offer a streaming mode. Because of this, JDBC plugin disables auto commit and switches to read-only connection by default. Nevertheless I think an incremental mode is generally a good idea, in particular, for scheduling regular runs, to automate the portion of data to be fetched. There are already some variables available for SQL binding parameters which can be extended.