Centos-8 regression failure

msaju commented 3 years ago

Centos-8 regression is failing with below error. Also: java.nio.file.FileSystemException: /home/jenkins/root/workspace/centos8-regression/tests/utils/pycache/libcxattr.cpython-36.pyc: Operation not permitted.

deepshikhaaa commented 3 years ago

This issue is actually fixed in https://github.com/gluster/project-infrastructure/issues/93

Root cause: centos8-regression has started taking more than 450 minutes since few days. No idea what has introduced this. After 450 mins the build usually get aborted and so no cleanup did happen and hence we see the leftovers again.

I have cleaned up the workspace and did trigger another build. We need to investigate why it is taking so long? I do not see it stuck anywhere though.

deepshikhaaa commented 3 years ago

https://build.gluster.org/job/centos8-regression/116/console

mscherer commented 3 years ago

See also https://github.com/gluster/project-infrastructure/issues/102 with long running job. Not sure what's the cause.

I submitted a PR for the cleanup

xhernandez commented 3 years ago

A new failure happened: https://build.gluster.org/job/centos8-regression/119/console

The root cause is insufficient space on bricks. The test creates a 4 GiB file that is then migrated by rebalance.

The logs from rebalance show these errors:

[2020-11-09 17:51:06.502608 +0000] I [dht-rebalance.c:1537:dht_migrate_file] 0-patchy-dht: /dir1/bar: attempting to move from patchy-client-1 to patchy-client-0
[2020-11-09 17:51:06.504356 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:1719:client4_0_fallocate_cbk] 0-patchy-client-0: remote operation failed. [{errno=28}, {error=No space left on device}] 
[2020-11-09 17:51:06.504388 +0000] E [MSGID: 109023] [dht-rebalance.c:732:__dht_rebalance_create_dst_file] 0-patchy-dht: fallocate failed for /dir1/bar on patchy-client-0 [No space left on device]
[2020-11-09 17:51:06.504652 +0000] E [MSGID: 0] [dht-rebalance.c:1693:dht_migrate_file] 0-patchy-dht: Create dst failed on - patchy-client-0 for file - /dir1/bar 
[2020-11-09 17:51:06.505302 +0000] E [MSGID: 109023] [dht-rebalance.c:2862:gf_defrag_migrate_single_file] 0-patchy-dht: migrate-data failed for /dir1/bar [No space left on device]
[2020-11-09 17:51:06.507397 +0000] I [MSGID: 109028] [dht-rebalance.c:4690:gf_defrag_status_get] 0-patchy-dht: Rebalance is completed. Time taken is 0.00 secs

This error already happened in the past and it was caused by small bricks, but I thought brick size had been increased since then.

Any idea what's happening ?

deepshikhaaa commented 3 years ago

@xhernandez This happened because the job did run on a builder that has 5GB /d space. The builder is builder212.int.aws.gluster.org and has the label 'centos8-testing'. This label was used in the job config and is fixed by this commit https://github.com/gluster/build-jobs/commit/329b97b43a732fcf53a186331ea981676ae8b609. The new centos8 builders with label 'centos8' do have sufficient space and now the job will pick these new ones.

We might need to manually increase the size of the brick on 212. We will take care of that.

msaju commented 3 years ago

Thanks Deepshikha. Is it possible to manually trigger the job, lets see if anything else fails.

deepshikhaaa commented 3 years ago

https://build.gluster.org/job/centos8-regression/121/console

msaju commented 3 years ago

Thanks a lot Xavi and Deepshikha for the support. The Centos 8 is passing now.

mscherer commented 3 years ago

So while we had some issue (likely related to resolv.conf and AWS on reboot), the test suite work. We are still looking at the DNS issue (I guess that's just "new NM + new cloud init"), but in the mean time, i am closing that one

gluster / project-infrastructure

Centos-8 regression failure #101