jprante / elasticsearch-jdbc

JDBC importer for Elasticsearch
Apache License 2.0
2.84k stars 709 forks source link

"ignore_null_values" doesn't work with the "column" strategy #370

Open pib opened 10 years ago

pib commented 10 years ago

When the "column" strategy is chosen, null values get inserted. This wouldn't be too much of a problem if it weren't for the fact that I am indexing geo points and the geo point parser flips out when it gets null for latitude or longitude.

I could force the values to be non-null with IFNULL(lat, 0), but then I will have a bunch of locations in my map in the middle of the Atlantic ocean, which is worse than having things not on the map at all :|

The test script below generates this output:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "mytest",
      "_type" : "jdbc",
      "_id" : "0",
      "_score" : 1.0,
      "_source":{"movie":{"title":null,"overview":null,"test":null},"source":"tv"}
    }, {
      "_index" : "mytest",
      "_type" : "jdbc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"movie":{"title":"Krieg der Welten","overview":"Eines windigen herbstlichen Nachmittags wird der","test":1212},"source":"tv"}
    } ]
  }
}
#!/bin/sh

/usr/local/bin/mysql -u root test <<EOT
DROP TABLE IF EXISTS test;
create table test (
    id integer,
    title varchar(32),
    overview varchar(64),
    test integer,
    source varchar(32),
    created_at timestamp,
    updated_at timestamp
);
insert into test values (0,NULL,NULL,NULL,"tv", NOW(), NOW());
insert into test values (1,"Krieg der Welten","Eines windigen herbstlichen Nachmittags wird der",1212,"tv", NOW(), NOW());
EOT

echo "deleting river ..."

curl -XDELETE 'localhost:9200/_river/my_test_river/'

echo "deleting index ..."

curl -XDELETE 'localhost:9200/mytest'

echo "river request ..."

curl -XPOST 'localhost:9200/_river/my_test_river/_meta' -d '
{
    "type" : "jdbc",
    "strategy": "column",
    "jdbc" : {
        "url" : "jdbc:mysql://localhost:3306/test",
        "user" : "root",
        "password" : "",
        "sql" : "select id as _id, title as \"movie.title\", overview as \"movie.overview\", test as \"movie.test\", source from test",
        "ignore_null_values" : true,
        "index" : "mytest"
    },
    "last_run_timestamp_overlap": "30s"
}
'

echo "sleeping while river should run..."

sleep 10

curl -XPOST 'localhost:9200/mytest/_search?pretty' -d '
{
       "query": {
          "match_all": {
           }
       }
}'

curl -XDELETE 'localhost:9200/_river/my_test_river/'
pib commented 10 years ago

I think I've tracked down why this doesn't work.

In the simple strategy, it creates a stream listener and sets the shouldIgnoreNull setting on it based on the settings: SimpleRiverSource.java#L385

In the column strategy it doesn't set that setting: ColumnRiverSource.java#L131

So I think that a fix should just be a matter of adding .shouldIgnoreNull(context.shouldIgnoreNull() to the column strategy code at that point.

pib commented 10 years ago

Yep, the fix was indeed that simple.