Reg recent CI failures on s390x builder

kun-lu20 commented 1 year ago

Description of problem:

Hi,

We noticed that recent CI builds are failing on the s390x builder (https://build.gluster.org/job/centos8-s390-regression/6/consoleFull). It took more than 7 hours to build and run the regression test suite but aborted after only 68/832 test cases had been run.

We also found the following errors in the build log:

10:19:59 losetup: /d/dev/loop*: failed to use device: No such device

17:41:09 + rm -f /var/run/glusterd.socket
17:41:09 + rm -rf /var/run/gluster/
17:41:09 + pkill /opt/qa/regression.sh
17:41:09 FATAL: command execution failed

17:41:09 Exception when executing the batch command : no workspace from node hudson.slaves.DumbSlave[builder-el8-s390x-1.ibm-l1.gluster.org] which is computer hudson.slaves.SlaveComputer@adfae62 and has channel null
17:41:09 Build step 'Post build task' marked build as failure
17:41:09 ERROR: Step ‘Archive the artifacts’ failed: no workspace for centos8-s390-regression #6
17:41:09 ERROR: builder-el8-s390x-1.ibm-l1.gluster.org is offline; cannot locate java-1.6.0-openjdk-1.6.0.0.x86_64

Can any developers or admins from community take a look at this issue? Thank you very much!

kun-lu20 commented 1 year ago

@mscherer

kun-lu20 commented 1 year ago

Hi @amarts , could you please take a look at this issue? Thank you very much!

amarts commented 1 year ago

@Shwetha-Acharya @rakshithakamath94 @mscherer is it something you guys can help here? A resolution would help a lot. thanks.

mscherer commented 1 year ago

I do not know enough the code for that, I was hoping to have a dev to look at that, at least to give a hint on where to look.

amarts commented 1 year ago

I thought the issue happened because of below?

ERROR: builder-el8-s390x-1.ibm-l1.gluster.org is offline; cannot locate java-1.6.0-openjdk-1.6.0.0.x86_64

the error in previous snippet

10:19:59 losetup: /d/dev/loop*: failed to use device: No such device

is not an issue to be bothered, as this happens in local setup too.

mscherer commented 1 year ago

My understanding is that this step happen after the build failed, and it fail because jenkins reboot as part of the build (since when tests fail, they tend to also mess with the build, we usually reboot them)

See https://github.com/gluster/build-jobs/blob/main/build-gluster-org/scripts/regression-abort.sh#L13

So what happen is that the regression test fail, so we clean up the server and reboot. The error message about jdk is that jenkins lost contact with the builder and is expected.

mscherer commented 1 year ago

Looking at the code, it fail as it reach a timeout:

23:41:02 Build timed out (after 450 minutes). Marking the build as aborted.

But given it only ran 62 tests out of 832, there is a massive slowdown somewhere, and it can come from either the code (eg, something slow on s390x specifically), or from the infra (eg, we are running that on a cluster of floppy disk, or more likely, on some oversubscribed storage).

I can try to extend the time out to see how long it will take to run, but if it take 3 to 4 days to run, it need to be investigated.

Do we have some benchmark or something to look at (eg, some perf test for gluster that would help see the issue ) ?

I can run some CPU one, some I/O ones, but that's all.

amarts commented 1 year ago

Good observation. Lets increase the timeout to 3-4 days for sure. And to start with CPU and I/O observations is good.

@kun-lu20 Any suggestions on this? Is it possible to check for the reasons for delay here? Non of us have any s390x system to run tests.

Shwetha-Acharya commented 1 year ago

I can see that volume start was failing at commit phase in every test.

19:51:46 ok 5 [ 34/ 848] < 33> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log volume create patchy 127.1.1.1:/d/backends/1/dist1 127.1.1.1:/d/backends/1/dist2 127.1.1.2:/d/backends/2/dist3 127.1.1.2:/d/backends/2/dist4' 19:51:46 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.

Also no logs or build-artifacts were found in: https://build.gluster.org/job/centos8-s390-regression/6/ to understand more about the failure. IMO glusterd logs and console output should give us some clue.

My suggestion is to execute a simple gluster volume create and start command manually on the regression machine and find out what is actually happening. If ever we have another runs where logs are present that can also help us. @mscherer @amarts @rakshithakamath94

amarts commented 1 year ago

My suggestion is to execute a simple gluster volume create and start command manually on the regression machine and find out what is actually happening. If ever we have another runs where logs are present that can also help us.

It would be great if at least one test is run locally on those machines (if possible), and we get the reason for this.

Shwetha-Acharya commented 1 year ago

@mscherer or @rakshithakamath94 might be able to run the test locally

mscherer commented 1 year ago

For now, I will increase the timeout to 4 days, and see how it goes.

kun-lu20 commented 1 year ago

Hi @amarts ,

Thank you so much for looking into this issue.

We also tried to run the regression tests from devel branch on local s390x VMs (8 vCPU, 16g vMEM, 250g Storage). All 831 test cases could be completed within 10 hours and only 10 test cases failed.

Currently there are 3 s390x builders in the Gluster builder pool:

148.100.84.186: (RHEL 8.4, 2 vCPUs/4GB vMem/50GB storage)
148.100.84.19, 148.100.84.23 : (RHEL 8.4, 8 vCPUs/12GB vMem/120GB storage)

I think Choosing the one with larger quota to run regression CI might be helpful.

@Shwetha-Acharya gluster volume create and start commands worked on our local s390x VMs.

kun-lu20 commented 1 year ago

Hi @amarts @mscherer ,

Just now I checked the CI status and found that the whole regression test suite could complete after increasing the timeout, but it took nearly 2 days and many test failures occurred.

I also checked the job scripts centos8-s390.yml and regression-test-burn-in.sh, but looks like the invoked scripts /opt/qa/build.sh and /opt/qa/regression.sh could not be found in this repo.

In our local build for running GlusterFS regression test on s390x, we need to install certain deps via yum and build DBENCH and thin-provisioning-tools from source on RHEL, see step 2.1 in Building GlusterFS on s390x. I wonder if this is part of the reason of the test failure in CI.

I think running the same CI scripts in our local environment might help to locate the root cause. Could you please show me where I can find the above mentioned missing scripts so that I can run the same scripts at my end?

Thank you very much!

mscherer commented 1 year ago

They come from https://github.com/gluster/glusterfs-patch-acceptance-tests

That's installed by our ansible playbooks, just a checkout is enough.

kun-lu20 commented 1 year ago

Cool. I'll try it. Thanks @mscherer !

mscherer commented 1 year ago

And the rest of the ansible playbooks are on https://github.com/gluster/gluster.org_ansible_configuration (if you need the rpms to install, that's somewhere in roles/jenkins_builder, along a bunch of conditional for distros, etc.

kun-lu20 commented 1 year ago

Hi all,

I've followed the steps in scripts build.sh and regression.sh from https://github.com/gluster/glusterfs-patch-acceptance-tests to run the regression tests on a local s390x VM (similar quota to our s390x builder), but I haven't observed the volume start failure at our end.

Below is the result of one of the test cases:

======================================== (2 / 833) ========================================
[21:40:29] Running tests in file ./tests/000-flaky/basic_changelog_changelog-snapshot.t
./tests/000-flaky/basic_changelog_changelog-snapshot.t ..
1..15
losetup: /d/dev/loop*: failed to use device: No such device
losetup: /d/dev/loop*: failed to use device: No such device
ok   1 [   2638/     51] <   9> 'verify_lvm_version'
ok   2 [     36/   2050] <  10> 'glusterd'
ok   3 [     17/      8] <  11> 'pidof glusterd'
ok   4 [     30/   2372] <  13> 'setup_lvm 1'
ok   5 [     20/    232] <  15> 'gluster --mode=script --wignore volume create patchy 10.11.58.155:/d/backends/patchy_snap_drafting1_mnt'
ok   6 [     44/    388] <  17> 'gluster --mode=script --wignore volume start patchy'
ok   7 [     24/    229] <  20> 'gluster --mode=script --wignore volume set patchy changelog.changelog on'
ok   8 [     33/    253] <  21> 'gluster --mode=script --wignore volume set patchy changelog.rollover-time 3'
ok   9 [     40/    145] <  22> '_GFS --attribute-timeout=0 --entry-timeout=0 --volfile-id=patchy --volfile-server=10.11.58.155 /mnt/glusterfs/0'
ok  10 [  14850/   2281] <  50> 'gluster --mode=script --wignore snapshot create patchy-snap1 patchy no-timestamp'
ok  11 [     34/    136] <  51> 'snapshot_exists 0 patchy-snap1'
ok  12 [     55/      8] <  53> 'grep "Enabled changelog barrier" /var/log/glusterfs/bricks/d-backends-patchy_snap_drafting1_mnt.log'
ok  13 [     32/      3] <  54> 'grep "Disabled changelog barrier" /var/log/glusterfs/bricks/d-backends-patchy_snap_drafting1_mnt.log'
ok  14 [     35/     49] <  56> 'glusterfs -s 10.11.58.155 --volfile-id=/snaps/patchy-snap1/patchy /mnt/glusterfs/1'
ok  15 [     35/   1324] <  59> 'gluster --mode=script --wignore volume stop patchy force'
  Logical volume "brick_lvm" successfully removed.
  Logical volume "e834be6d11114e0bb915a7ffba7c10f6_0" successfully removed.
  Logical volume "thinpool" successfully removed.
  Volume group "patchy_snap_drafting1_vg_1" successfully removed
losetup: /d/dev/loop*: failed to use device: No such device
losetup: /d/dev/loop*: failed to use device: No such device
ok
All tests successful.
Files=1, Tests=15, 35 wallclock secs ( 0.02 usr  0.01 sys +  2.25 cusr 10.06 csys = 12.34 CPU)
Result: PASS
Logs preserved in tarball basic_changelog_changelog-snapshot-iteration-1.tar.gz
End of test ./tests/000-flaky/basic_changelog_changelog-snapshot.t
================================================================================

When comparing it to the same test case in the CI test log (https://build.gluster.org/job/centos8-s390-regression/59/consoleFull), I noticed some odd failure info in the CI log:

10:44:10 ok   3 [    137/    105] <  11> 'pidof glusterd'
10:44:10 fallocate: fallocate failed: No space left on device
10:44:10 losetup: /d/backends/patchy_snap_builder_el8_s390x_1_vhd: Warning: file is smaller than 512 bytes; the loop device may be useless or invisible for system tools.
10:44:10   Cannot use /d/backends/patchy_snap_builder_el8_s390x_1_loop: device is too small (pv_min_size)
10:44:10   Cannot use /d/backends/patchy_snap_builder_el8_s390x_1_loop: device is too small (pv_min_size)
10:44:10   Command requires all devices to be found.
10:44:10   Volume group "patchy_snap_builder_el8_s390x_1_vg_1" not found
10:44:10   Cannot process volume group patchy_snap_builder_el8_s390x_1_vg_1
10:44:10   Volume group "patchy_snap_builder_el8_s390x_1_vg_1" not found
10:44:10   Cannot process volume group patchy_snap_builder_el8_s390x_1_vg_1
10:44:10 Error accessing specified device /dev/patchy_snap_builder_el8_s390x_1_vg_1/brick_lvm: No such file or directory
... ...
10:44:10 mount: /d/backends/patchy_snap_builder_el8_s390x_1_mnt: special device /dev/patchy_snap_builder_el8_s390x_1_vg_1/brick_lvm does not exist.

Looks like certain device space setup doesn't work properly. Also volume start still failed:

10:44:10 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.
10:44:10 not ok   6 [    232/   1577] <  17> 'gluster --mode=script --wignore volume start patchy' -> ''

Could you please run some simple tests on the regression machine as mentioned by @Shwetha-Acharya and @amarts to see what was happening?

Thank you very much! @mscherer @rakshithakamath94

kun-lu20 commented 1 year ago

Hi @mscherer @rakshithakamath94 , could you please take a look when you have some time? Thank you!

mscherer commented 1 year ago

I do not really know where to start. And there is already logs for the build, is there a specific part that I would see that can't be seen in neither the logs in the tarball or the test output ?

kun-lu20 commented 1 year ago

Hi @mscherer , Thank you very much!

A lot of volume start failures could be seen in the CI testing log as shown below, which could pass on our local s390x VMs.

10:44:10 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.
10:44:10 not ok   6 [    232/   1577] <  17> 'gluster --mode=script --wignore volume start patchy' -> ''

Could you please run a single test case (such as ./tests/000-flaky/basic_distribute_rebal-all-nodes-migrate.t) on the s390x builder to check why this would happen?

Also, I found the following info in today's building log:

09:47:19 libtool: install: /usr/bin/install -c .libs/gluster-setgfid2path /build/install/sbin/gluster-setgfid2path
09:47:19 /usr/bin/install: cannot create regular file '/build/install/sbin/gluster-setgfid2path': No space left on device

I think we might need to free some space on the s390x builders.

Thanks again!

mscherer commented 1 year ago

So 1 builder was in a weird state (the 2nd one), I rebooted.

mscherer commented 1 year ago

The 3rd was also in the same state. /d was full, but du showed only 300 Mb used. And several gluster process where lingering.

kun-lu20 commented 1 year ago

Hi @mscherer ,

I checked the logs of the 3rd s390x builder after you rebooted it. The volume could be created successfully but still failed to be started. There were also other failures such as peer probe: failed in the log.

10:17:27 ======================================== (3 / 833) ========================================
10:17:27 [14:17:27] Running tests in file ./tests/000-flaky/basic_distribute_rebal-all-nodes-migrate.t
10:18:43 ./tests/000-flaky/basic_distribute_rebal-all-nodes-migrate.t .. 
10:18:43 1..38
10:18:43 losetup: /d/dev/loop*: failed to use device: No such device
10:18:43 ok   1 [    219/   3766] <  26> 'launch_cluster 3'
10:18:43 peer probe: failed: Probe returned with Transport endpoint is not connected
10:18:43 not ok   2 [     13/   1148] <  27> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log peer probe 127.1.1.2' -> ''
10:18:43 peer probe: failed: Probe returned with Transport endpoint is not connected
10:18:43 not ok   3 [     12/   1107] <  28> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log peer probe 127.1.1.3' -> ''
10:18:43 not ok   4 [     16/  60091] <  29> '2 peer_count' -> 'Got "0" instead of "2"'
10:18:43 ok   5 [     15/    103] <  33> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log volume create patchy 127.1.1.1:/d/backends/1/dist1 127.1.1.1:/d/backends/1/dist2 127.1.1.2:/d/backends/2/dist3 127.1.1.2:/d/backends/2/dist4'
10:18:43 volume start: patchy: failed: Commit failed on localhost. Please check log file for details.
10:18:43 not ok   6 [     13/   1119] <  35> 'gluster --mode=script --wignore --glusterd-sock=/d/backends/1/glusterd/gd.sock --log-file=/var/log/glusterfs/basic_distribute_rebal-all-nodes-migrate.t_cli1.log volume start patchy' -> ''

I suspect it was caused by certain environment setup since these failures didn't occur on our local s390x VMs. Could you please take a look? Thanks very much for your time and effort.

kun-lu20 commented 1 year ago

Hi @amarts @mscherer ,

could you please check this issue again? Thank you so much!

mscherer commented 1 year ago

I think that's outside of my competency with Gluster code base :/

kun-lu20 commented 1 year ago

Thanks for your time and efforts @mscherer .

Hi @amarts , could you please take a look when you have some time? Thank you.

black-dragon74 commented 1 year ago

On all of the s390x regression runs, I see the following error in the very beginning.

/opt/qa/regression.sh: line 47: python: command not found

I can see all the usual locations being present in PATH variable. Is python installed on the system? If yes (in case it's python3), maybe it is not symlinked as python?

/cc: @mscherer

Regards

rakshithakamath94 commented 1 year ago

I have provided a temporary local change to see if it resolving this error would makes the job work - https://build.gluster.org/job/centos8-s390-regression/73/console and the error is not showing in the logs.

Also, I noticed the same kind of error on cento8-regression but that runs fine.

black-dragon74 commented 1 year ago

Also, I noticed the same kind of error on cento8-regression but that runs fine.

Agreed.

I also see many losetup related errors. Is the module loop loaded into the kernel? Can verify by:

grep 'loop' /lib/modules/$(uname -r)/modules.builtin

I think this module is absent or not loaded otherwise we would have loop devices under /dev (logs are showing: /dev/loop* no such device).

P.S: SSH access to the machine would help and would eliminate cases of me asking such questions :P

Regards

kun-lu20 commented 1 year ago

@black-dragon74 @rakshithakamath94 Thanks very much for looking into this issue!

black-dragon74 commented 1 year ago

Can someone help me with the SSH access to the VMs? If yes, I can dig further into this.

Regards

kun-lu20 commented 1 year ago

Hi @mscherer ,

Could you please grant @black-dragon74 the SSH access to the s390x builders so that he can dig further into this issue? Thank you very much!

kun-lu20 commented 1 year ago

Hi @black-dragon74 ,

Hope all is well.

Just want to check if you have any updates or insights reg the volume start failure on the s390x CI.

Thank you!

xhernandez commented 1 year ago

@kun-lu20 I've been running the regression test suite on a private s390x machine and fixing all the issues I detected (some of them were not deficiencies in the test itself, but real bugs on Gluster). Currently I'm able to pass the full suite without any failure in a bit less than 5 hours, which is very similar to what regular builders take.

Most of the patches are already sent and merged, but not all. Once everything is ready, could you try it again ?

kun-lu20 commented 1 year ago

Hi @xhernandez ,

That's great to hear. Thank you so much!

Sure, I'll try it on s390x VMs. Please let me know once it's ready.

Since most of the patches are merged, can we re-enable the CI build on s390x builders?

Thanks again.

kun-lu20 commented 1 year ago

Hi @mscherer ,

Hope all is well.

@xhernandez mentioned most of the patches have been merged, and now it takes less than 5 hrs to run the whole regression test suite on s390x.

Once everything is ok, could you please re-enable the s390x builders on the community CI?

Thank you in advance!

xhernandez commented 1 year ago

Hi @xhernandez ,

That's great to hear. Thank you so much!

Sure, I'll try it on s390x VMs. Please let me know once it's ready.

The current devel branch should pass all the tests (sometimes a test could fail, as it also happens in the regular machines, but there's no test failing systematically now).

The existing s390x builders needed a couple of tweaks to make them work, but they should be running again very soon.

kun-lu20 commented 1 year ago

Hi @xhernandez ,

Thank you very much! I'll also try the current devel branch on s390x VMs.

kun-lu20 commented 1 year ago

Hi @xhernandez ,

I've run the whole test suite on devel branch, all tests passed on s390x.

Thanks again for the amazing work!

xhernandez commented 1 year ago

Hi @xhernandez ,

I've run the whole test suite on devel branch, all tests passed on s390x.

That's great :)

However I can't guarantee that it works 100% correctly. More testing should be done with more real workloads before using it in production.

kun-lu20 commented 1 year ago

Hi @xhernandez ,

Hope all is well with you.

We observed that the s390x builders were back online for several days the week before last, but looks like there were still some environment issues, such as No space left on device (https://build.gluster.org/job/centos8-s390-regression/90/console), so that the build and regression tests couldn't succeed.

Could you please take a look when you have a chance? Thank you very much!

xhernandez commented 1 year ago

@mscherer can you take a look ?

mscherer commented 1 year ago

[root@builder-el8-s390x-3 /]# ls glfs_sigwai* | wc -l
730

Something created a ton of files at the root. So I cleaned that, and I also rebooted the server, as it had a lot of runaway processes.

xhernandez commented 1 year ago

@mscherer it seems that a centos8-s390-regression job is pending but none of the s390x builders comes online to process it.

mscherer commented 1 year ago

Indeed, the 1st builder is not coming back on line and Jenkins is just waiting. I stopped it and manually started the 3rd one. It is scheduled.

kun-lu20 commented 1 year ago

Hi @xhernandez @mscherer ,

Thanks so much for checking the regression jobs on s390x builders. Now the build can succeed and only 13 in 837 test cases failed on s390x (https://build.gluster.org/job/centos8-s390-regression/93/console).

01:47:07 13 test(s) failed 
01:47:07 ./tests/00-geo-rep/00-georep-verify-non-root-setup.t
01:47:07 ./tests/00-geo-rep/00-georep-verify-setup.t
01:47:07 ./tests/00-geo-rep/01-georep-glusterd-tests.t
01:47:07 ./tests/00-geo-rep/bug-1600145.t
01:47:07 ./tests/00-geo-rep/georep-basic-dr-rsync-arbiter.t
01:47:07 ./tests/00-geo-rep/georep-basic-dr-rsync.t
01:47:07 ./tests/00-geo-rep/georep-basic-dr-tarssh-arbiter.t
01:47:07 ./tests/00-geo-rep/georep-basic-dr-tarssh.t
01:47:07 ./tests/00-geo-rep/georep-basic-rsync-ec.t
01:47:07 ./tests/00-geo-rep/georep-basic-tarssh-ec.t
01:47:07 ./tests/00-geo-rep/georep-config-upgrade.t
01:47:07 ./tests/00-geo-rep/georep-stderr-hang.t
01:47:07 ./tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t

@xhernandez Could you please take a look at those remaining failures when you have a chance? I remembered that last time when we tested on devel branch, these failures didn't occur. Thanks again!

xhernandez commented 1 year ago

I already checked that. Almost all failing tests are related to geo-replication. There is a problem in the environment that prevents gluster processes to locate gluster libraries when accessing through ssh (used to connect to secondary volume in geo-replication). I fixed that but I don't have access to the builders nor the jenkins system to apply the changes and trigger jobs, so we'll need to wait for @mscherer, who will be back on Monday, to check if this fixes all the issues (besides some spurious failures that we also have even in the "good" builders).

kun-lu20 commented 1 year ago

Thanks @xhernandez !

kun-lu20 commented 1 year ago

The latest job (https://build.gluster.org/job/centos8-s390-regression/95/) has succeeded. It only took 3.5 hrs and all tests passed.

Thanks so much for your time and effort! @xhernandez @mscherer

gluster / glusterfs

Reg recent CI failures on s390x builder #3655