metal3-io / baremetal-operator

Bare metal host provisioning integration for Kubernetes
Apache License 2.0
551 stars 241 forks source link

[Flake] E2E Fixture flaky with multiple parallel tests #1554

Open mquhuy opened 5 months ago

mquhuy commented 5 months ago

What steps did you take and what happened: In the recent weeks we've seen the GH-based fixture tests failing randomly. This seems to happen after we introduced the parallel tests in BMO E2E, and could be reproduced by running

export GINKGO_NODES=4
make test-e2e

(For GINKGO_NODES=2, tests do not always fail)

We have disabled the parallel test for fixture in https://github.com/metal3-io/baremetal-operator/pull/1543, but we believe this shows an issue with the BMO test-mode (which is used in fixture test).

In BMO normal ("ironic") tests, similar failure was observed when the number of tests running in parallel is too many for the local machine (for e.g. 3 threads with 3 VMs of 2 vcpus each, on an environment with 8 CPUs).

[SynchronizedBeforeSuite] PASSED [79.610 seconds]                                                                                                                                                                                                                                                                                                                                                                                         
[SynchronizedBeforeSuite]                                                                                                                                                                                                                                                                                                                                                                                                                 
/home/ubuntu/baremetal-operator/test/e2e/e2e_suite_test.go:78                                                                                                                                                                                                                                                                                                                                                                             
------------------------------                                                                                                                                                                                                                                                                                                                                                                                                            
• [FAILED] [311.529 seconds]                                                                                                                                                                                                                                                                                                                                                                                                              
Live-ISO [It] should provision a BMH with live ISO and then deprovision it                                                                                                                                                                                                                                                                                                                                                                
/home/ubuntu/baremetal-operator/test/e2e/live_iso_test.go:43                                                                                                                                                                                                                                                                                                                                                                              

  Timeline >>                                                                                                                                                                                                                                                                                                                                                                                                                             
  INFO: Creating namespace live-iso-ops-el5ucm                                                                                                                                                                                                                                                                                                                                                                                            
  INFO: Creating event watcher for namespace "live-iso-ops-el5ucm"                                                                                                                                                                                                                                                                                                                                                                        
  STEP: Creating a secret with BMH credentials @ 02/13/24 08:57:17.972                                                                                                                                                                                                                                                                                                                                                                    
  STEP: Creating a BMH with inspection disabled and hardware details added @ 02/13/24 08:57:17.98                                                                                                                                                                                                                                                                                                                                         
  STEP: Waiting for the BMH to be in provisioning state @ 02/13/24 08:57:17.999                                                                                                                                                                                                                                                                                                                                                           
  STEP: Waiting for the BMH to become provisioned @ 02/13/24 08:57:18.407                                                                                                                                                                                                                                                                                                                                                                 
  [FAILED] in [It] - /home/ubuntu/baremetal-operator/test/e2e/common.go:195 @ 02/13/24 09:02:18.409                                                                                                                                                                                                                                                                                                                                       
  INFO: Deleting namespace live-iso-ops-el5ucm                                                                                                                                                                                                                                                                                                                                                                                            
  << Timeline                                                                                                                                                                                                                                                                                                                                                                                                                             

  [FAILED] Timed out after 300.001s.                                                                                                                                                                                                                                                                                                                                                                                                      
  The function passed to Eventually failed at /home/ubuntu/baremetal-operator/test/e2e/common.go:194 with:                                                                                                                                                                                                                                                                                                                                
  Expected                                                                                                                                                                                                                                                                                                                                                                                                                                
      <v1alpha1.ProvisioningState>: provisioning                                                                                                                                                                                                                                                                                                                                                                                          
  to equal                                                                                                                                                                                                                                                                                                                                                                                                                                
      <v1alpha1.ProvisioningState>: provisioned                                                                                                                                                                                                                                                                                                                                                                                           
  In [It] at: /home/ubuntu/baremetal-operator/test/e2e/common.go:195 @ 02/13/24 09:02:18.409                                                                                                                                                                                                                                                                                                                                              

  Full Stack Trace                                                                                                                                                                                                                                                                                                                                                                                                                        
    github.com/metal3-io/baremetal-operator/test/e2e.WaitForBmhInProvisioningState({_, _}, {{0x388cfc0, 0xc0018822d0}, {{{0x0, 0x0}, {0x0, 0x0}}, {{0xc000951c00, 0xc}, ...}, ...}, ...}, ...)                                                                                                                                                                                                                                            
        /home/ubuntu/baremetal-operator/test/e2e/common.go:195 +0x12f                                                                                                                                                                                                                                                                                                                                                                     
    github.com/metal3-io/baremetal-operator/test/e2e.glob..func8.2()                                                                                                                                                                                                                                                                                                                                                                      
        /home/ubuntu/baremetal-operator/test/e2e/live_iso_test.go:86 +0x771                                                                                                                                                                                                                                                                                                                                                               
------------------------------                                                                                                                                                                                                                                                                                                                                                                                                            
• [FAILED] [316.492 seconds]                                                                                                                                                                                                                                                                                                                                                                                                              
BMH Provisioning and Annotation Management [It] provisions a BMH, applies detached and status annotations, then deprovisions                                                                                                                                                                                                                                                                                                              
/home/ubuntu/baremetal-operator/test/e2e/provisioning_and_annotation_test.go:44                                                                                                                                                                                                                                                                                                                                                           

  Timeline >>                                                                                                                                                                                                                                                                                                                                                                                                                             
  INFO: Creating namespace provisioning-ops-dpjxle                                                                                                                                                                                                                                                                                                                                                                                        
  INFO: Creating event watcher for namespace "provisioning-ops-dpjxle"                                                                                                                                                                                                                                                                                                                                                                    
  STEP: Creating a secret with BMH credentials @ 02/13/24 08:57:17.99                                                                                                                                                                                                                                                                                                                                                                     
  STEP: Creating a BMH with inspection disabled and hardware details added @ 02/13/24 08:57:17.999                                                                                                                                                                                                                                                                                                                                        
  STEP: Waiting for the BMH to become available @ 02/13/24 08:57:18.017                                                                                                                                                                                                                                                                                                                                                                   
  STEP: Patching the BMH to test provisioning @ 02/13/24 08:57:19.031                                                                                                                                                                                                                                                                                                                                                                     
  STEP: Waiting for the BMH to be in provisioning state @ 02/13/24 08:57:19.06                                                                                                                                                                                                                                                                                                                                                            
  STEP: Waiting for the BMH to become provisioned @ 02/13/24 08:57:19.102                                                                                                                                                                                                                                                                                                                                                                 
  INFO: WARNING: Skipping SSH check since SSH_CHECK_PROVISIONED != true                                                                                                                                                                                                                                                                                                                                                                   
  STEP: Retrieving the latest BMH object @ 02/13/24 08:57:20.117                                                                                                                                                                                                                                                                                                                                                                          
  STEP: Adding the detached annotation @ 02/13/24 08:57:20.129                                                                                                                                                                                                                                                                                                                                                                            
  STEP: Saving the status to a JSON string @ 02/13/24 08:57:20.169                                                                                                                                                                                                                                                                                                                                                                        
  STEP: Deleting the BMH @ 02/13/24 08:57:20.169                                                                                                                                                                                                                                                                                                                                                                                          
  STEP: Waiting for the BMH to be deleted @ 02/13/24 08:57:22.186                                                                                                                                                                                                                                                                                                                                                                         
  STEP: Waiting for the secret to be deleted @ 02/13/24 08:57:22.277                                                                                                                                                                                                                                                                                                                                                                      
  STEP: Creating a secret with BMH credentials @ 02/13/24 08:57:22.286                                                                                                                                                                                                                                                                                                                                                                    
  STEP: Recreating the BMH with the previously saved status in the status annotation @ 02/13/24 08:57:22.295                                                                                                                                                                                                                                                                                                                              
  STEP: Checking that the BMH goes directly to 'provisioned' state @ 02/13/24 08:57:22.311                                                                                                                                                                                                                                                                                                                                                
  STEP: Triggering the deprovisioning of the BMH @ 02/13/24 08:57:23.327                                                                                                                                                                                                                                                                                                                                                                  
  STEP: Waiting for the BMH to be in deprovisioning state @ 02/13/24 08:57:23.362                                                                                                                                                                                                                                                                                                                                                         
  STEP: Waiting for the BMH to become available again @ 02/13/24 08:57:23.389                                                                                                                                                                                                                                                                                                                                                             
  [FAILED] in [It] - /home/ubuntu/baremetal-operator/test/e2e/common.go:195 @ 02/13/24 09:02:23.391                                                                                                                                                                                                                                                                                                                                       
  INFO: Deleting namespace provisioning-ops-dpjxle                                                                                                                                                                                                                                                                                                                                                                                        
  << Timeline                                                                                                                                                                                                                                                                                                                                                                                                                             

  [FAILED] Timed out after 300.001s.                                                                                                                                                                                                                                                                                                                                                                                                      
  The function passed to Eventually failed at /home/ubuntu/baremetal-operator/test/e2e/common.go:194 with:                                                                                                                                                                                                                                                                                                                                
  Expected                                                                                                                                                                                                                                                                                                                                                                                                                                
      <v1alpha1.ProvisioningState>: deprovisioning                                                                                                                                                                                                                                                                                                                                                                                        
  to equal                                                                                                                                                                                                                                                                                                                                                                                                                                
      <v1alpha1.ProvisioningState>: available                                                                                                                                                                                                                                                                                                                                                                                             
  In [It] at: /home/ubuntu/baremetal-operator/test/e2e/common.go:195 @ 02/13/24 09:02:23.391                                                                                                                                                                                                                                                                                                                                              

  Full Stack Trace                                                                                                                                                                                                                                                                                                                                                                                                                        
    github.com/metal3-io/baremetal-operator/test/e2e.WaitForBmhInProvisioningState({_, _}, {{0x388cfc0, 0xc000644990}, {{{0x0, 0x0}, {0x0, 0x0}}, {{0xc000badb40, 0x10}, ...}, ...}, ...}, ...)                                                                                                                                                                                                                                           
        /home/ubuntu/baremetal-operator/test/e2e/common.go:195 +0x12f                                                                                                                                                                                                                                                                                                                                                                     
    github.com/metal3-io/baremetal-operator/test/e2e.glob..func9.2()                                                                                                                                                                                                                                                                                                                                                                      
        /home/ubuntu/baremetal-operator/test/e2e/provisioning_and_annotation_test.go:251 +0x1fc5

What did you expect to happen: Fixture test should have passed as long as the number of threads is in the range that the machine where the tests run can handle.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

/kind bug

lentzi90 commented 5 months ago

/retitle E2E Fixture flaky with multiple parallel tests

lentzi90 commented 5 months ago

BMO normal ("ironic") tests do not suffer from this issue.

Are you sure about this? I hope it is just an issue with the test-mode but I would not rule out some other concurrency issue in BMO. It may just be less frequent or harder to spot in other tests. Every now and then we have unexplained timeouts while deprovisioning also in CAPM3 e2e tests so who knows :shrug:

mquhuy commented 5 months ago

BMO normal ("ironic") tests do not suffer from this issue.

Are you sure about this? I hope it is just an issue with the test-mode but I would not rule out some other concurrency issue in BMO. It may just be less frequent or harder to spot in other tests. Every now and then we have unexplained timeouts while deprovisioning also in CAPM3 e2e tests so who knows 🤷

At least I have not seen this happen in ironic test, but I guess I should change the wording. Thank you for the notice xD

Rozzii commented 4 months ago

/triage accepted

mboukhalfa commented 1 month ago

This is not visible on the CI because currently we are setting GINKGO_NODES to 1