Closed fruch closed 13 hours ago
@syuu1228 - can you check what changed?
@fruch - why is it related to offline? If we update the OS, things work?
@syuu1228 - can you check what changed?
@mykaul we know what's change and it's mentioned in the description of the issue.
@fruch - why is it related to offline? If we update the OS, things work?
it's not about updates, it's that some dynamic libraries are not patched using pathelf to the reloc directory that is used for scylla-python3
hence it's working on newer distros versions, i.e. newer glibc version vs. older glibc
as for why seen in reloc only, I don't know yet. (maybe we didn't got to the phase yet, i.e. no artifacts built yet)
@syuu1228 - can you check what changed?
@mykaul we know what's change and it's mentioned in the description of the issue.
I understand it's due to the usage of wheels, OK - so what has 'escaped' the relocatable process is the ask. I thought it should not rely on glibc.
@syuu1228 - can you check what changed?
@mykaul we know what's change and it's mentioned in the description of the issue.
I understand it's due to the usage of wheels, OK - so what has 'escaped' the relocatable process is the ask. I thought it should not rely on glibc.
cqlsh package never had code to patch anything, so now it's artifacts has binaries, that it didn't had before. and they are unpatched, and not part of any relocatable process
someone in rush for getting https://github.com/scylladb/scylladb/pull/19205, and start pushing for that in, while not yet completely review the change, nor tested it completely
It all started with https://github.com/scylladb/scylla-cqlsh/issues/90 , no?
@fruch Who is handling this?
I see you are working on it :-),
I see you are working on it :-),
I might, I'm not gonna rush into this, just cause others were rushing....
This is breaking master (Unified-deb and Centos-rpm), preventing us from building cloud images and running testing. It seems that https://github.com/scylladb/scylla-cqlsh/pull/91 was merged although it wasn't ready and wasn't tested at all (see @fruch comment - https://github.com/scylladb/scylladb/pull/19205#issuecomment-2184920604) So we can either try to fix it (Not sure how much time it will take, @fruch Can estimate probably), or we can (and maybe should) revert https://github.com/scylladb/scylladb/pull/19473, and the following PR which probably lean on it, https://github.com/scylladb/scylladb/pull/19205, https://github.com/scylladb/scylladb/pull/19528, https://github.com/scylladb/scylladb/pull/19531 and https://github.com/scylladb/scylla-pkg/pull/4162
@mykaul @avikivity @roydahan please advice
Until @fruch will be able to priortize it, I suggest to revert everything related to it.
@avikivity @mykaul now this is also affecting enterprise
since we merged those changes yesterday from OSS
@avikivity @mykaul now this is also affecting
enterprise
since we merged those changes yesterday from OSS
We should have reverted yesterday.
@mykaul we can't do the revert, it involved multiple commits in Scylla core. @avikivity Can you take of it? or should i ask any maintainer to do it?
@avikivity - please add your thoughts here.
@yaronkaikov @mykaul @fruch I guess cqlsh version of relocatable package script does not working correctly for native binary, maybe because it does not calling patchelf. I'm looking into it now.
@yaronkaikov @mykaul @fruch I guess cqlsh version of relocatable package script does not working correctly for native binary, maybe because it does not calling patchelf. I'm looking into it now.
Ah, not it's not, it's occurs on python code for some reason
Okay, I think I understood the problem. On relocatable package, every native binary should patched with patchelf to set "rpath" to change the library path to under /opt/scylladb. But on scylla-cqlsh, we use "shiv" to create self-contained, single-artifact executables, and this contained native binary(s) which is not patched.
I found that when cqlsh, it extracted dependencies to "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages", and then load native binary "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so" and it depends to /lib64/libpthread.so.0, so caused error.
newfstatat(AT_FDCWD, "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages", {st_mode=S_IFDIR|0775, st_size=4096, ...}, 0) = 0
...
openat(AT_FDCWD, "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/pycache/six.cpython-312.pyc", O_RDONLY|O_CLOEXEC) = 3
...
newfstatat(AT_FDCWD, "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so", {st_mode=S_IFREG|0755, st_size=2282560, ...}, 0) = 0 openat(AT_FDCWD, "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so", O_RDONLY|O_CLOEXEC) = 3
...
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=23563, ...}) = 0 mmap(NULL, 23563, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f6fea445000 close(3) = 0 openat(AT_FDCWD, "/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
- ldd result on cluster.cpython-312-x86_64-linux-gnu.so:
$ ldd /home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2f e1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so linux-vdso.so.1 => (0x00007ffc1a158000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f78c4fca000) libc.so.6 => /lib64/libc.so.6 (0x00007f78c4bfc000) /lib64/ld-linux-x86-64.so.2 (0x00007f78c51e6000)
To fix this, maybe we need to create modified module before packing modules by shiv, and pack modified one by shiv.
Or unpack generated cqlsh binary, unpack and modify binary, then packa again if it possible (it will be bit complecated since it is "zipapp").
Note that the patchelf command is something like this: https://github.com/scylladb/scylla-python3/blob/master/scripts/create-relocatable-package.py#L97
Won't it be enough to run cqlsh under the relocatable python3? Which in turn is linked against the same glibc that was used when building the wheels.
Won't it be enough to run cqlsh under the relocatable python3? Which in turn is linked against the same glibc that was used when building the wheels.
it's already running under scylla-python3, but the files we have in the shiv package wasn't patched to point to the scylla-python3/lib64, so they fail to load.
in https://github.com/scylladb/scylla-cqlsh/pull/96 I'm trying removing the driver out of the shiv package, and using the driver installed in scylla-python3 (and with https://github.com/scylladb/scylla-python3/pull/40) I think it's might work.
using this PR, to try to confirm it's builds and runs dtest o.k.: https://github.com/scylladb/scylladb/pull/19558
I did confirm it to be working with offline installer on rocky8
Packages
Scylla version:
6.1.0~dev-20240626.3c7af287253e
with build-idea788466880c3bc01475b564b8aad1ac69fa6e43
Kernel Version:
4.18.0-553.5.1.el8_10.x86_64
Issue description
seems like after the merge of #91 into scylla master, lots of the artifacts started failing:
that is cause by:
so something escaped, what we assumed a relocatable package
Impact
cqlsh isn't working in offline installers on older distros
How frequently does it reproduce?
seems it reproduces on multiple jobs
Installation details
Cluster size: 1 nodes (n1-standard-2)
Scylla Nodes used in this run:
OS / Image:
https://www.googleapis.com/compute/v1/projects/rocky-linux-cloud/global/images/family/rocky-linux-8
(gce: undefined_region)Test:
artifacts-rocky8-test
Test id:13af6568-efe4-4101-9e68-7b21e47eb68b
Test name:scylla-master/artifacts-offline-install/artifacts-rocky8-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 13af6568-efe4-4101-9e68-7b21e47eb68b` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=13af6568-efe4-4101-9e68-7b21e47eb68b) - Show all stored logs command: `$ hydra investigate show-logs 13af6568-efe4-4101-9e68-7b21e47eb68b` ## Logs: - **db-cluster-13af6568.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/db-cluster-13af6568.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/db-cluster-13af6568.tar.gz) - **sct-runner-events-13af6568.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-runner-events-13af6568.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-runner-events-13af6568.tar.gz) - **sct-13af6568.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-13af6568.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-13af6568.log.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/artifacts-offline-install/job/artifacts-rocky8-test/424/) [Argus](https://argus.scylladb.com/test/f55a98d0-4b3f-41f4-bdea-30462f078e31/runs?additionalRuns[]=13af6568-efe4-4101-9e68-7b21e47eb68b)