scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

migrating tables with cdc enabled ends with Failed to execute #84

Open carlo4002 opened 1 year ago

carlo4002 commented 1 year ago

Hello guys

I am testing a migration of a table with cdc enable in source (cassandra) and target ( scylladb). The job finished in error with the next message.

22/07/13 14:09:49 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBoundStatementWrapper@6512ab5fcom.datastax.oss.driver.api.core.servererrors.InvalidQueryException:cdc: attempted to get a stream from an earlier generation than the currently used one.With CDC you cannot send writes with timestamps too far into the past,because that would break consistency properties (write timestamp: 2018/11/21 23:56:33, current generation started at: 2022/07/11 10:15:34)

I cannot change current generation because it is the date the cluster was created. disabling the CDC in the target fix this problem but we need CDC enable during our dual writes ( migration is without downtime )

is it there a way force this writes ?

tarzanek commented 1 year ago

This looked like a CDC bug there is a way to fix the streams

see https://github.com/scylladb/scylla/issues/7127

tarzanek commented 1 year ago

API for that was in https://github.com/scylladb/scylla/issues/6498

but all this assumes the error comes from Scylla, not sure about cassandra @carlo4002

and looking closer it's more about migrator preserving timestamps and writing with old timestamps

tarzanek commented 1 year ago

also I am confused by your implementation of dual writes, you just need CDC on source of dual writes and consume and write it to target. There is a kafka CDC consumer to help with this. Target won't need CDC at all, resp. I don't see why you would need it there.

Also note that other option is to just do dual writes from application and in such case you won't need CDC anywhere (but a small code change would be needed in client of course).

tarzanek commented 1 year ago

so I would just migrate with CDC disabled in target

alternative is of course disabling preserveTimestamp in migrator, but this way you will risk overwriting dual written data!

carlo4002 commented 1 year ago

Hello @tarzanek , Sorry it took so long to give you my feedback about this, I am still working on this and yes my work around was to disable cdc in scylla for the moment.

The cdc is scylla isn't for the migration (dual writes) but for some applications that use this db. So not all the tables have the cdc.

So when I say I need migration without downtime, I wanted to say that cdc must be enable in target for those tables that need it. However we are going to switch those services after first load, so no need to have the cdc on