scylladb / scylla-cqlsh

A fork of the cqlsh code
Apache License 2.0
11 stars 26 forks source link

cqlsh fails to work on old distros (undefined symbol: __libc_siglongjmp, version GLIBC_PRIVATE) #95

Closed fruch closed 13 hours ago

fruch commented 1 week ago

Packages

Scylla version: 6.1.0~dev-20240626.3c7af287253e with build-id ea788466880c3bc01475b564b8aad1ac69fa6e43

Kernel Version: 4.18.0-553.5.1.el8_10.x86_64

Issue description

seems like after the merge of #91 into scylla master, lots of the artifacts started failing:

sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: '/usr/bin/cqlsh --no-color   --request-timeout=120 --connect-timeout=60  -e "desc keyspaces" 10.142.0.121'
Exit code: 1
Stdout:
Stderr:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/opt/scylladb/share/cassandra/bin/../libexec/cqlsh/__main__.py", line 3, in <module>
File "/opt/scylladb/share/cassandra/bin/../libexec/cqlsh/_bootstrap/__init__.py", line 253, in bootstrap
File "/opt/scylladb/share/cassandra/bin/../libexec/cqlsh/_bootstrap/__init__.py", line 81, in import_string
File "/opt/scylladb/share/cassandra/bin/../libexec/cqlsh/_bootstrap/__init__.py", line 87, in import_string
ImportError: module 'cqlsh' has no attribute '__main__'

that is cause by:

< t:2024-06-26 22:38:02,593 f:base.py         l:152  c:RemoteLibSSH2CmdRunner p:DEBUG >   File "/opt/scylladb/share/cassandra/bin/../libexec/cqlsh/_bootstrap/__init__.py", line 76, in import_string
< t:2024-06-26 22:38:02,593 f:base.py         l:152  c:RemoteLibSSH2CmdRunner p:DEBUG >   File "/home/scylla-test/.shiv/cqlsh_142fa711ec80198d9794271f89723bdc5719a9a533a65a105b396960a222f9b1/site-packages/cqlsh/__main__.py", line 3, in <module>
< t:2024-06-26 22:38:02,593 f:base.py         l:152  c:RemoteLibSSH2CmdRunner p:DEBUG >     from cqlsh.cqlsh import main as cqlsh_main
< t:2024-06-26 22:38:02,593 f:base.py         l:152  c:RemoteLibSSH2CmdRunner p:DEBUG >   File "/home/scylla-test/.shiv/cqlsh_142fa711ec80198d9794271f89723bdc5719a9a533a65a105b396960a222f9b1/site-packages/cqlsh/cqlsh.py", line 132, in <module>
< t:2024-06-26 22:38:02,593 f:base.py         l:152  c:RemoteLibSSH2CmdRunner p:DEBUG >     from cassandra.cluster import Cluster, EXEC_PROFILE_DEFAULT, ExecutionProfile
< t:2024-06-26 22:38:02,593 f:base.py         l:152  c:RemoteLibSSH2CmdRunner p:DEBUG > ImportError: /lib64/libpthread.so.0: undefined symbol: __libc_siglongjmp, version GLIBC_PRIVATE

so something escaped, what we assumed a relocatable package

Impact

cqlsh isn't working in offline installers on older distros

How frequently does it reproduce?

seems it reproduces on multiple jobs

Installation details

Cluster size: 1 nodes (n1-standard-2)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/rocky-linux-cloud/global/images/family/rocky-linux-8 (gce: undefined_region)

Test: artifacts-rocky8-test Test id: 13af6568-efe4-4101-9e68-7b21e47eb68b Test name: scylla-master/artifacts-offline-install/artifacts-rocky8-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 13af6568-efe4-4101-9e68-7b21e47eb68b` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=13af6568-efe4-4101-9e68-7b21e47eb68b) - Show all stored logs command: `$ hydra investigate show-logs 13af6568-efe4-4101-9e68-7b21e47eb68b` ## Logs: - **db-cluster-13af6568.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/db-cluster-13af6568.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/db-cluster-13af6568.tar.gz) - **sct-runner-events-13af6568.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-runner-events-13af6568.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-runner-events-13af6568.tar.gz) - **sct-13af6568.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-13af6568.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/13af6568-efe4-4101-9e68-7b21e47eb68b/20240626_224415/sct-13af6568.log.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/artifacts-offline-install/job/artifacts-rocky8-test/424/) [Argus](https://argus.scylladb.com/test/f55a98d0-4b3f-41f4-bdea-30462f078e31/runs?additionalRuns[]=13af6568-efe4-4101-9e68-7b21e47eb68b)
mykaul commented 1 week ago

@syuu1228 - can you check what changed?

mykaul commented 1 week ago

@fruch - why is it related to offline? If we update the OS, things work?

fruch commented 1 week ago

@syuu1228 - can you check what changed?

@mykaul we know what's change and it's mentioned in the description of the issue.

fruch commented 1 week ago

@fruch - why is it related to offline? If we update the OS, things work?

it's not about updates, it's that some dynamic libraries are not patched using pathelf to the reloc directory that is used for scylla-python3

hence it's working on newer distros versions, i.e. newer glibc version vs. older glibc

as for why seen in reloc only, I don't know yet. (maybe we didn't got to the phase yet, i.e. no artifacts built yet)

mykaul commented 1 week ago

@syuu1228 - can you check what changed?

@mykaul we know what's change and it's mentioned in the description of the issue.

I understand it's due to the usage of wheels, OK - so what has 'escaped' the relocatable process is the ask. I thought it should not rely on glibc.

fruch commented 1 week ago

@syuu1228 - can you check what changed?

@mykaul we know what's change and it's mentioned in the description of the issue.

I understand it's due to the usage of wheels, OK - so what has 'escaped' the relocatable process is the ask. I thought it should not rely on glibc.

cqlsh package never had code to patch anything, so now it's artifacts has binaries, that it didn't had before. and they are unpatched, and not part of any relocatable process

someone in rush for getting https://github.com/scylladb/scylladb/pull/19205, and start pushing for that in, while not yet completely review the change, nor tested it completely

mykaul commented 1 week ago

It all started with https://github.com/scylladb/scylla-cqlsh/issues/90 , no?

yaronkaikov commented 1 week ago

@fruch Who is handling this?

yaronkaikov commented 1 week ago

I see you are working on it :-),

fruch commented 1 week ago

I see you are working on it :-),

I might, I'm not gonna rush into this, just cause others were rushing....

yaronkaikov commented 1 week ago

This is breaking master (Unified-deb and Centos-rpm), preventing us from building cloud images and running testing. It seems that https://github.com/scylladb/scylla-cqlsh/pull/91 was merged although it wasn't ready and wasn't tested at all (see @fruch comment - https://github.com/scylladb/scylladb/pull/19205#issuecomment-2184920604) So we can either try to fix it (Not sure how much time it will take, @fruch Can estimate probably), or we can (and maybe should) revert https://github.com/scylladb/scylladb/pull/19473, and the following PR which probably lean on it, https://github.com/scylladb/scylladb/pull/19205, https://github.com/scylladb/scylladb/pull/19528, https://github.com/scylladb/scylladb/pull/19531 and https://github.com/scylladb/scylla-pkg/pull/4162

@mykaul @avikivity @roydahan please advice

roydahan commented 1 week ago

Until @fruch will be able to priortize it, I suggest to revert everything related to it.

yaronkaikov commented 1 week ago

@avikivity @mykaul now this is also affecting enterprise since we merged those changes yesterday from OSS

mykaul commented 1 week ago

@avikivity @mykaul now this is also affecting enterprise since we merged those changes yesterday from OSS

We should have reverted yesterday.

yaronkaikov commented 6 days ago

@mykaul we can't do the revert, it involved multiple commits in Scylla core. @avikivity Can you take of it? or should i ask any maintainer to do it?

mykaul commented 6 days ago

@avikivity - please add your thoughts here.

syuu1228 commented 6 days ago

@yaronkaikov @mykaul @fruch I guess cqlsh version of relocatable package script does not working correctly for native binary, maybe because it does not calling patchelf. I'm looking into it now.

syuu1228 commented 6 days ago

@yaronkaikov @mykaul @fruch I guess cqlsh version of relocatable package script does not working correctly for native binary, maybe because it does not calling patchelf. I'm looking into it now.

Ah, not it's not, it's occurs on python code for some reason

syuu1228 commented 6 days ago

Okay, I think I understood the problem. On relocatable package, every native binary should patched with patchelf to set "rpath" to change the library path to under /opt/scylladb. But on scylla-cqlsh, we use "shiv" to create self-contained, single-artifact executables, and this contained native binary(s) which is not patched.

I found that when cqlsh, it extracted dependencies to "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages", and then load native binary "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so" and it depends to /lib64/libpthread.so.0, so caused error.

...

openat(AT_FDCWD, "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/pycache/six.cpython-312.pyc", O_RDONLY|O_CLOEXEC) = 3

...

newfstatat(AT_FDCWD, "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so", {st_mode=S_IFREG|0755, st_size=2282560, ...}, 0) = 0 openat(AT_FDCWD, "/home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2fe1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so", O_RDONLY|O_CLOEXEC) = 3

...

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=23563, ...}) = 0 mmap(NULL, 23563, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f6fea445000 close(3) = 0 openat(AT_FDCWD, "/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3


- ldd result on cluster.cpython-312-x86_64-linux-gnu.so:

$ ldd /home/vagrant/.shiv/cqlsh_49bb05031b5eaaa3dafd8fc6c2f e1eeee2a9a37c21356bfd43e63523f688ba3e/site-packages/cassandra/cluster.cpython-312-x86_64-linux-gnu.so linux-vdso.so.1 => (0x00007ffc1a158000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f78c4fca000) libc.so.6 => /lib64/libc.so.6 (0x00007f78c4bfc000) /lib64/ld-linux-x86-64.so.2 (0x00007f78c51e6000)



To fix this, maybe we need to create modified module before packing modules by shiv, and pack modified one by shiv.
Or unpack generated cqlsh binary, unpack and modify binary, then packa again if it possible (it will be bit complecated since it is "zipapp").
syuu1228 commented 6 days ago

Note that the patchelf command is something like this: https://github.com/scylladb/scylla-python3/blob/master/scripts/create-relocatable-package.py#L97

avikivity commented 6 days ago

Won't it be enough to run cqlsh under the relocatable python3? Which in turn is linked against the same glibc that was used when building the wheels.

fruch commented 6 days ago

Won't it be enough to run cqlsh under the relocatable python3? Which in turn is linked against the same glibc that was used when building the wheels.

it's already running under scylla-python3, but the files we have in the shiv package wasn't patched to point to the scylla-python3/lib64, so they fail to load.

in https://github.com/scylladb/scylla-cqlsh/pull/96 I'm trying removing the driver out of the shiv package, and using the driver installed in scylla-python3 (and with https://github.com/scylladb/scylla-python3/pull/40) I think it's might work.

using this PR, to try to confirm it's builds and runs dtest o.k.: https://github.com/scylladb/scylladb/pull/19558

I did confirm it to be working with offline installer on rocky8