Closed vponomaryov closed 5 months ago
Since it is flaky failure.
Here is another passed
test run with completely the same setting, but without bug reproduction:
Kernel Version: 5.15.0-1040-aws
Scylla version (or git commit hash): 5.2.6-20230730.58acf071bf28
with build-id 17961be569f8503b27ff284a8de1e00a9d83811e
Cluster size: 3 nodes (i3.large)
Scylla Nodes used in this run:
OS / Image: ami-0f1f419420235978b
(aws: us-east-1)
Test: vp-gemini-3h-with-nemesis-test
Test id: 04811d3b-6563-4cb7-9336-e454663457ce
Test name: scylla-staging/valerii/vp-gemini-3h-with-nemesis-test
Test config file(s):
The master
scylla branch is also affected
:
> Executing CQL '<SimpleStatement query="SELECT * FROM ks1.table1", consistency=Not Set>' ...
> Error decoding response from Cassandra. ver(3); flags(0000); stream(2); op(8); offset(9); len(151051); buffer: b'\x83 ... <huge piece of data> ... \x04:J\x96R'
> Traceback (most recent call last):
> File "cassandra/obj_parser.pyx", line 72, in cassandra.obj_parser.TupleRowParser.unpack_row
> File "cassandra/deserializers.pxd", line 41, in cassandra.deserializers.from_binary
> File "cassandra/deserializers.pyx", line 143, in cassandra.deserializers.DesDateType.deserialize
> File "cassandra/cython_utils.pyx", line 62, in cassandra.cython_utils.datetime_from_timestamp
> File ".eggs/Cython-0.29.32-py3.10-linux-x86_64.egg/Cython/Includes/cpython/datetime.pxd", line 125, in cpython.datetime.timedelta_new
> OverflowError: days=2147458792; must have magnitude <= 999999999
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
> File "cassandra/row_parser.pyx", line 38, in cassandra.row_parser.make_recv_results_rows.recv_results_rows
> File "cassandra/obj_parser.pyx", line 31, in cassandra.obj_parser.ListParser.parse_rows
> File "cassandra/obj_parser.pyx", line 74, in cassandra.obj_parser.TupleRowParser.unpack_row
> cassandra.DriverException: Failed decoding result column "ck1" of type timestamp: days=2147458792; must have magnitude <= 999999999
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
> File "cassandra/obj_parser.pyx", line 72, in cassandra.obj_parser.TupleRowParser.unpack_row
> File "cassandra/deserializers.pxd", line 41, in cassandra.deserializers.from_binary
> File "cassandra/deserializers.pyx", line 143, in cassandra.deserializers.DesDateType.deserialize
> File "cassandra/cython_utils.pyx", line 62, in cassandra.cython_utils.datetime_from_timestamp
> File ".eggs/Cython-0.29.32-py3.10-linux-x86_64.egg/Cython/Includes/cpython/datetime.pxd", line 125, in cpython.datetime.timedelta_new
> OverflowError: days=2147458792; must have magnitude <= 999999999
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
> File "cassandra/connection.py", line 1266, in cassandra.connection.Connection.process_msg
> File "cassandra/protocol.py", line 1196, in cassandra.protocol._ProtocolHandler.decode_message
> File "cassandra/protocol.py", line 744, in cassandra.protocol.ResultMessage.recv_body
> File "cassandra/protocol.py", line 730, in cassandra.protocol.ResultMessage.recv
> File "cassandra/row_parser.pyx", line 46, in cassandra.row_parser.make_recv_results_rows.recv_results_rows
> File "cassandra/obj_parser.pyx", line 57, in cassandra.obj_parser.TupleRowParser.unpack_row
> File "cassandra/obj_parser.pyx", line 74, in cassandra.obj_parser.TupleRowParser.unpack_row
> cassandra.DriverException: Failed decoding result column "ck1" of type timestamp: days=2147458792; must have magnitude <= 999999999
Kernel Version: 5.15.0-1039-aws
Scylla version (or git commit hash): 5.4.0~dev-20230702.1ab2bb69b8a6
with build-id 748a09783d18c97d903f62dcf4d96a6d9db4b527
Cluster size: 3 nodes (i3.large)
Scylla Nodes used in this run:
OS / Image: ami-027d7b085be72f13c
(aws: us-east-1)
Test: vp-gemini-3h-with-nemesis-test
Test id: bf42fd21-7f82-45e9-9424-ec362a75e473
Test name: scylla-staging/valerii/vp-gemini-3h-with-nemesis-test
Test config file(s):
Looks very similar to the following closed bug: https://github.com/scylladb/scylladb/issues/12239
Looks like a duplicate of https://github.com/scylladb/scylladb/issues/14645
It was fixed in Gemini, and not sure if it's in 1.8.6 version
It seems this issue is about the "timestamp", the other was about "time".
@eliransin please assign it, adding it as a blocker until we will have some diagnostic from your team.
@vponomaryov - can you try with gemini pre 1.8.4, to see if it happens? I'm quite sure it's a change in Gemini. It may or may not be a bug that we need to fix in Scylla side, but I'd like to ensure it's not a regression, first of all. Alternatively, test with an older release of Scylla, to prove it's not a recent regression.
@vponomaryov - can you try with gemini pre 1.8.4, to see if it happens? I'm quite sure it's a change in Gemini. It may or may not be a bug that we need to fix in Scylla side, but I'd like to ensure it's not a regression, first of all. Alternatively, test with an older release of Scylla, to prove it's not a recent regression.
Yes, this bug was caught using the v1.8.6
gemini version only.
Before, we were using 1.7.8
version of the gemini. It became unstable in a different way, probably, due to new versions of other tools including docker
.
Either Scylla or python driver have bug: Scylla allows to write values which driver complains about. Gemini should have failed with impossibility to write data, but it passed.
From the try 1.8.4- gemini version
point of view I can say that I had successful run using 1.7.8
and lots of failures with another kind of an error.
Also, tried 1.8.2
, failed with one another error, not this one.
To summarize:
not blocker
from my point of view, but it is not only gemini
s bug, gemini only highlights it.@cvybhu is this in your plans for this Q?
@cvybhu is this in your plans for this Q?
No, I haven't seen this before.
To me this looks like a python driver bug.
The rust driver allows to insert and select a timestamp equal to i64::MAX
(2^63 - 1
) and i64::MIN
.
use scylla::{Session, SessionBuilder};
#[tokio::main]
async fn main() {
let session: Session = SessionBuilder::new()
.known_node("127.0.0.1:9042")
.build()
.await
.unwrap();
session.query("CREATE KEYSPACE IF NOT EXISTS ks WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1}", &[]).await.unwrap();
session
.query(
"CREATE TABLE IF NOT EXISTS ks.t (p timestamp PRIMARY KEY)",
&[],
)
.await
.unwrap();
session
.query("INSERT INTO ks.t (p) VALUES (?)", (i64::MAX,))
.await
.unwrap();
session
.query("INSERT INTO ks.t (p) VALUES (?)", (i64::MIN,))
.await
.unwrap();
let ps: Vec<i64> = session
.query("SELECT p FROM ks.t", &[])
.await
.unwrap()
.rows_typed::<(i64,)>()
.unwrap()
.map(|parse_res| parse_res.unwrap())
.map(|p: (i64,)| p.0)
.collect();
for p in ps {
println!("p: {}", p)
}
}
# Output:
p: 9223372036854775807
p: -9223372036854775808
OTOH cqlsh
(which uses the python driver underneath) reports an overflow:
[cqlsh 6.0.9 | Scylla 5.2.3-0.20230608.ea08d409f155 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> SELECT p FROM ks.t;
/home/jancio/.local/lib/python3.11/site-packages/cqlsh/cqlsh.py:403: DateOverFlowWarning: Some timestamps are larger than Python datetime can represent. Timestamps are displayed in milliseconds from epoch.
p
----------------------
9223372036854775807
-9223372036854775808
(2 rows)
The CQL spec defines timestamp
as an 8 byte signed integer, so IMO both of these values should be legal:
https://github.com/apache/cassandra/blob/7750a9430bf8cec73c25e302642a3b0b77a78703/doc/native_protocol_v4.spec#L946-L950
To me it doesn't look like a Scylla bug. Scylla correctly writes and reads timestamp values, it's the driver that complains about "out of range values", so the bug is in the driver.
/cc @avelanarius
To me it doesn't look like a Scylla bug. Scylla correctly writes and reads timestamp values, it's the driver that complains about "out of range values", so the bug is in the driver.
/cc @avelanarius
It's not the first time we are running into such a warning, since in some cases it surfaced a real scylla issue (like using the wrong clock), I would argue we shouldn't remove the warning from cqlsh/python-driver
In this case I do agree it's not scylla core fault and caused by a change in scylla-bench, one if those were already fixed in scylla-bench,
if a user puts a timestamp of 292471209 years into the future, I think he would like to know it...
The same problem was reported in Cassandra eight years ago - see https://issues.apache.org/jira/browse/CASSANDRA-10625. They fixed cqlsh (see https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10625) to print the warning that @cvybhu saw above instead of failing, but didn't fix the driver. Maybe there are additional cases (not just cqlsh...) that need to be fixed?
But this issue should probably be reopened in the driver repository, not here. I think Scylla's handling of the "timestamp" column type isn't the problem here.
The same problem was reported in Cassandra eight years ago - see https://issues.apache.org/jira/browse/CASSANDRA-10625. They fixed cqlsh (see apache/cassandra@trunk...pauloricardomg:trunk-10625) to print the warning that @cvybhu saw above instead of failing, but didn't fix the driver. Maybe there are additional cases (not just cqlsh...) that need to be fixed?
But this issue should probably be reopened in the driver repository, not here. I think Scylla's handling of the "timestamp" column type isn't the problem here.
I agree it's should be move to python-driver (and might be a problem also on other drivers that convert to date objects) also we probably would need to implement same workaround as in https://github.com/apache/cassandra/commit/70c8a53de4881a3ccbcf5df7a68f44a57b103f12, in SCT if not on the driver level, cause this is where we see the issue.
This is a real bug in the Python driver, that has been known for nine years (since https://issues.apache.org/jira/browse/CASSANDRA-10625) and unfortunately instead of fixing it we ended up hacking cqlsh and now gemini to work around it. But it just needs to be fixed.
Basically, if d
is a date
column and I do
INSERT INTO {table1} (p, d) VALUES ({p}, {2**30})
SELECT totimestamp(d) FROM {table1} WHERE p={p}
The timestamp returned by Scylla for totimestamp(d) is valid, but the Scylla driver doesn't agree to accept it from the server (!), saying:
cassandra.DriverException: Failed decoding result column "system.totimestamp(d)" of type timestamp: days=-1073741824; must have magnitude <= 999999999
(note that the date 2**30
means 2^30
days before the epoch (since the epoch is taken to be day 2^31
) , and this is exactly that number -1073741824 which the driver doesn't like, although it is a perfectly valid (albeit non-useful) timestamp.
Unfortunately, it a timestamp of more than 999999999 days isn't the only problem, even lower-magnitude timestamps cause problems. I tried in the above example to set d to 2**31 - 2**29
, i.e, "just" 2**29 days before the epoch, -536870912 days. So it's not more than 999999999 and we don't reach that error, but the driver still fails:
cassandra/cluster.py:2726: in cassandra.cluster.Session.execute
???
cassandra/cluster.py:5085: in cassandra.cluster.ResponseFuture.result
???
cassandra/connection.py:1267: in cassandra.connection.Connection.process_msg
???
cassandra/protocol.py:1212: in cassandra.protocol._ProtocolHandler.decode_messag
e
???
cassandra/protocol.py:760: in cassandra.protocol.ResultMessage.recv_body
???
cassandra/protocol.py:746: in cassandra.protocol.ResultMessage.recv
???
cassandra/row_parser.pyx:46: in cassandra.row_parser.make_recv_results_rows.recv_results_rows
???
cassandra/obj_parser.pyx:57: in cassandra.obj_parser.TupleRowParser.unpack_row
???
> ???
E cassandra.DriverException: Failed decoding result column "system.totimestamp(d)" of type timestamp: date value out of range
from:
> ???
cassandra/obj_parser.pyx:72:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
cassandra/deserializers.pxd:41:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
cassandra/deserializers.pyx:143:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: date value out of range
cassandra/cython_utils.pyx:62: OverflowError
I believe these errors are present because we overflow python datetime.[MIN|MAX]
. For datetime.date
we have a wrapper to address that https://github.com/scylladb/python-driver/commit/4f3c77c970e10774243e7ad7baed2d41c06c4790 now we need something similar for datetime.datetime
. I am actively working on it.
The fix proposed in https://github.com/scylladb/python-driver/pull/310 is a breaking change that may break user code - which we should not do. We could do a major-version bump but that would make syncing with upstream even more problematic. If we were to do a breaking release at some point I think we should include more changes there, not only this one. PR was proposed to upstream, we'll see what they have to say about it - if they merge it then we'll receive it at some point when merging upstream commits.
Given that the issue is not a big problem, unlikely to be an issue for any client (as it is just an unlikely edge-case) and fixing it creates other problems I think we can let upstream deal with it and close the PR and issue (as wontfix) unless someone has any objections.
We are not merging the fix in #310 to not break backwards compatibility, but if someone needs datetimes outside datetime.[MIN|MAX]YEAR feel free to cherry-pick.
Issue description
Running the
rebuild_streaming_err
nemesis in thegemini
CI job usinggemini-v1.8.6
, where we walk through the test tables and check for data presence, we get following error:Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Happens in about
50% cases
usinggemini-v1.8.6
in the gemini CI job if no data mismatches get caught.Installation details
Kernel Version: 5.15.0-1040-aws Scylla version (or git commit hash):
5.2.6-20230730.58acf071bf28
with build-id17961be569f8503b27ff284a8de1e00a9d83811e
Cluster size: 3 nodes (i3.large)
Scylla Nodes used in this run:
OS / Image:
ami-0f1f419420235978b
(aws: us-east-1)Test:
vp-gemini-3h-with-nemesis-test
Test id:985fe5dc-f634-47d6-b517-814916df7a5e
Test name:scylla-staging/valerii/vp-gemini-3h-with-nemesis-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 985fe5dc-f634-47d6-b517-814916df7a5e` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=985fe5dc-f634-47d6-b517-814916df7a5e) - Show all stored logs command: `$ hydra investigate show-logs 985fe5dc-f634-47d6-b517-814916df7a5e` ## Logs: - **db-cluster-985fe5dc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/db-cluster-985fe5dc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/db-cluster-985fe5dc.tar.gz) - **sct-runner-events-985fe5dc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/sct-runner-events-985fe5dc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/sct-runner-events-985fe5dc.tar.gz) - **sct-985fe5dc.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/sct-985fe5dc.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/sct-985fe5dc.log.tar.gz) - **monitor-set-985fe5dc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/monitor-set-985fe5dc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/monitor-set-985fe5dc.tar.gz) - **loader-set-985fe5dc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/loader-set-985fe5dc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/loader-set-985fe5dc.tar.gz) - **parallel-timelines-report-985fe5dc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/parallel-timelines-report-985fe5dc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/985fe5dc-f634-47d6-b517-814916df7a5e/20230804_014205/parallel-timelines-report-985fe5dc.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-gemini-3h-with-nemesis-test/15/) [Argus](https://argus.scylladb.com/test/def77ee5-c8b1-41b4-b912-17ce86aca11e/runs?additionalRuns[]=985fe5dc-f634-47d6-b517-814916df7a5e)