strapdata / elassandra

Elassandra = Elasticsearch + Apache Cassandra
http://www.elassandra.io
Apache License 2.0
1.72k stars 200 forks source link

TTL deletions still reflected in elasticsearch documents #225

Closed zallan114 closed 6 years ago

zallan114 commented 6 years ago

elassandra: 6.2.3.4/5.5.0.13

STEPS TO REPRODUCE:

CREATE KEYSPACE test_tl WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1'} AND durable_writes = true; CREATE TABLE test_tl.clicks ( userid uuid, url text, date timestamp,
name text, PRIMARY KEY (userid, url) ) with default_time_to_live = 30 AND gc_grace_seconds = 0;

$curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/test_tl -d '{"settings": { "keyspace":"test_tl" }, "mappings": { "clicks" : { "discover" : ".*" } }}' {"acknowledged":true,"shards_acknowledged":true,"index":"test_tl"}

cqlsh> INSERT INTO test_tl.clicks ( userid, url, date, name) VALUES (6715e600-2eb0-11e2-81c1-0800200c9a66,'http://topm.org','2015-10-09', 'Jack') USING TTL 86400;

cqlsh> INSERT INTO test_tl.clicks ( userid, url, date, name) VALUES (9715e600-2eb0-11e2-81c1-0800200c9a66,'http://topm.org','2015-10-09', 'Mary');

after 30 seconds , the cqlsh side will auto delete the last record in cassandra side by TTL definition, but not from elasticsearch index -

CHECK ELASTICSEARCH SIDE:

"hits": { "total": 1, "max_score": 1, "hits": [

  {
    "_index": "test_tl",
    "_type": "clicks",
    "_id": """["9715e600-2eb0-11e2-81c1-0800200c9a66","url"]""",
    "_score": 1
  },
  {
    "_index": "test_tl",
    "_type": "clicks",
    "_id": """["6715e600-2eb0-11e2-81c1-0800200c9a66","http://topm.org"]""",
    "_score": 1,
    "_source": {
      "date": "2015-10-09T00:00:00.000Z",
      "name": "Jack",
      "userid": "6715e600-2eb0-11e2-81c1-0800200c9a66",
      "url": "http://topm.org"
    }
  }

how to remove the empty _source records , or filter out the empty _source record -

"_id": """["9715e600-2eb0-11e2-81c1-0800200c9a66","url"]""",

======================================================

Describe the feature:

Elasticsearch version:

Plugins installed: []

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) index creation, mappings, settings, query etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

1. 2. 3.

Provide logs (if relevant):

vroyer commented 6 years ago

Hi,

Yes, this is normal, elasticsearch remove expired documents on compaction, when index_on_compaction=true (default is false). You should rather use time-frame indices and delete obsolete indices. Thanks.

Le 11 sept. 2018 à 23:03, zallan114 notifications@github.com a écrit :

elassandra: 6.2.3.4/5.5.0.13

STEPS TO REPRODUCE:

CREATE KEYSPACE test_tl WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1'} AND durable_writes = true; CREATE TABLE test_tl.clicks ( userid uuid, url text, date timestamp, name text, PRIMARY KEY (userid, url) ) with default_time_to_live = 30 AND gc_grace_seconds = 0;

$curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/test_tl http://localhost:9200/test_tl -d '{"settings": { "keyspace":"test_tl" }, "mappings": { "clicks" : { "discover" : ".*" } }}' {"acknowledged":true,"shards_acknowledged":true,"index":"test_tl"}

cqlsh> INSERT INTO test_tl.clicks ( userid, url, date, name) VALUES (6715e600-2eb0-11e2-81c1-0800200c9a66,'http://topm.org','2015-10-09 http://topm.org','2015-10-09', 'Jack') USING TTL 86400;

cqlsh> INSERT INTO test_tl.clicks ( userid, url, date, name) VALUES (9715e600-2eb0-11e2-81c1-0800200c9a66,'http://topm.org','2015-10-09 http://topm.org','2015-10-09', 'Mary');

after 30 seconds ,

CHECK ELASTICSEARCH SIDE:

"hits": { "total": 1, "max_score": 1, "hits": [

{ "_index": "test_tl", "_type": "clicks", "_id": """["9715e600-2eb0-11e2-81c1-0800200c9a66","url"]""", "_score": 1 }, { "_index": "test_tl", "_type": "clicks", "_id": """["6715e600-2eb0-11e2-81c1-0800200c9a66","http://topm.org"]""", "_score": 1, "_source": { "date": "2015-10-09T00:00:00.000Z", "name": "Jack", "userid": "6715e600-2eb0-11e2-81c1-0800200c9a66", "url": "http://topm.org" } } how to remove the empty _source records , or filter out the empty _source record -

"_id": """["9715e600-2eb0-11e2-81c1-0800200c9a66","url"]""",

======================================================

Describe the feature:

Elasticsearch version:

Plugins installed: []

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) index creation, mappings, settings, query etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

Provide logs (if relevant):

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/strapdata/elassandra/issues/225, or mute the thread https://github.com/notifications/unsubscribe-auth/AJzHmc7LhK4f_iUs2JPvIGX6_i70rGm3ks5uaKO-gaJpZM4Wks65.

zallan114 commented 6 years ago

thanks

zallan114 commented 6 years ago

Hi, @vroyer ,

"elasticsearch remove expired documents on compaction" , but before compaction, there are lots of empty _source records in documents, if we make count calculation, it could cause wrong result, so could you please help let me know how to filter out those emtpy _source record , instead of removing them mannually or waiting comapction. many thanks

zallan114 commented 6 years ago

"You should rather use time-frame indices and delete obsolete indices."

==> does this mean that TTL settings is not a good solution for clearing history data. or say this is a TTL bug unable to clear empty records returned in elasticsearch side.

vroyer commented 6 years ago

TTL is good for clearing data from cassandra. The idea is to use time-frame index with a partition function and a shorter retention that your cassandra TTL. For exemple, 1 index per day, retention 7d, cassandra TTL=8d.

Le 12 sept. 2018 à 02:58, zallan114 notifications@github.com a écrit :

"You should rather use time-frame indices and delete obsolete indices."

==> does this mean that TTL settings is not a good solution for clearing history data. or say this is a TTL bug unable to clear empty records returned in elasticsearch side.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/strapdata/elassandra/issues/225#issuecomment-420587585, or mute the thread https://github.com/notifications/unsubscribe-auth/AJzHmXr2WADHunCe2f6RgBUl_ADPNLxHks5uaNrVgaJpZM4Wks65.

zallan114 commented 6 years ago

Yes, TTL is good for cassandra, but what is the reason to leave a empty _source record returned from corresponding elasticsearch side, should this be a bug? I think at least I need to find out how to filter out those emtpy _source record , or it may cause Count calculation wrong.

DBarthe commented 6 years ago

Hello @zallan114,

The _source document is constructed during the fetch phase when the data is pulled from the cassandra table... at that time all elasticsearch filters and aggregations have already been made with the values found in the index. Therefore you cannot filter those empty records.

But anyway, deletions by TTL at the cassandra level are never propagated to the elasticsearch indices. Even if you manage to filter out those "deleted by TTL" documents, your index will grow infinitely because there will never be properly deleted.

If you want to rely on cassandra TTL, it's good, but you have to do what @vroyer said :

Hope this helps

zallan114 commented 6 years ago

Thanks a lot @DBarthe , this helps.

silviucpp commented 6 months ago

@DBarthe if you use partition index, when you search for the data how do you know in what index to search the information ?