X server problems when running "terminus drush" prevent CI from running properly

jcnventura commented 4 years ago

A problem we encounter often during our CI processes for deploying Drupal is the following:

(EE) 00:02 467Fatal server error: 468(EE) Server is already active for display 99 469 If this server is no longer running, remove /tmp/.X99-lock 470 and start again. 471(EE) or _XSERVTransmkdir: Owner of /tmp/.X11-unix should be set to root

This usually occurs when running drush commands via terminus: terminus -n drush pantheon_project.env -- -y cr

The CI command uses the following base image: image: quay.io/pantheon-public/build-tools-ci:6.x

Re-running the deployment a few times will eventually suceeed in a successful deployment, but this can be quite a problem, and while it was a nuisance before, it has recently become a major annoyance that can easily make what was supposed to be a 2h job to deploy our multiple Pantheon-based sites into a multiple day waste of time.

greg-1-anderson commented 4 years ago

I have never seen that failure. If anyone has any suggestions about a fix, I'd be interested.

From the symptom, it looks like it might have something to do with the headless chrome browser; I don't think that anything else is using X11. However, that is in conflict with your assertion that it happens during terminus drush calls, which are ssh connections, so I'm not sure what is going on.

We inherit our chrome configuration from our base image, circleci/php:7.3-node-browsers.

jcnventura commented 4 years ago

Is there a way to figure out if this error is originating in the terminus environment (i.e the quay.io/pantheon-public/build-tools-ci:6.x machine) or in the drush environment that terminus is connecting to?

jcnventura commented 4 years ago

Also, this CI is not running under CircleCI, but under GitLab CI.

greg-1-anderson commented 4 years ago

Try modifying the call to Terminus to:

terminus -n -vvv drush pantheon_project.env -- -y cr --debug

The placement of the error message vis-a-vis the debug output should make it more clear where the error is being generated.

jcnventura commented 4 years ago

Status Code: 200
 [warning] This environment is in read-only Git mode. If you want to make changes to the codebase of this site (e.g. updating modules or plugins), you will need to toggle into read/write SFTP mode first.
Warning: Permanently added '[appserver.dev.3c84c450-5e4f-4820-9347-9d4f40d12991.drush.in]:2222,[34.90.80.87]:2222' (RSA) to the list of known hosts.
 [preflight] Redispatch to site-local Drush: /code/vendor/drush/drush/drush.
 [preflight] Config paths: /.drush/drush.yml,/code/vendor/drush/drush/drush.yml
 [preflight] Alias paths: /code/web/drush/sites,/code/drush/sites
 [preflight] Commandfile search paths: /code/vendor/drush/drush/src,/opt/pantheon/drupal-extensions
 [bootstrap] Starting bootstrap to site [0.29 sec, 11.58 MB]
 [bootstrap] Drush bootstrap phase 2 [0.29 sec, 11.58 MB]
 [bootstrap] Try to validate bootstrap phase 2 [0.29 sec, 11.58 MB]
 [bootstrap] Try to validate bootstrap phase 2 [0.3 sec, 11.59 MB]
 [bootstrap] Try to bootstrap at phase 2 [0.3 sec, 11.59 MB]
 [bootstrap] Drush bootstrap phase: bootstrapDrupalRoot() [0.3 sec, 11.59 MB]
 [bootstrap] Change working directory to /code/web [0.3 sec, 11.59 MB]
 [bootstrap] Initialized Drupal 8.9.3 root directory at /code/web [0.3 sec, 11.81 MB]
 [bootstrap] Try to validate bootstrap phase 2 [0.3 sec, 11.81 MB]
 [bootstrap] Try to bootstrap at phase 2 [0.31 sec, 12.1 MB]
 [bootstrap] Drush bootstrap phase: bootstrapDrupalSite() [0.31 sec, 12.1 MB]
 [bootstrap] Initialized Drupal site dev-pantheon_project.pantheonsite.io at sites/default [0.31 sec, 12.33 MB]
 [success] Cache rebuild complete. [14.2 sec, 112.88 MB]
 [notice] Command: pantheon_project.dev -- drush [Exit: 137]
 [error]   
(EE) 
00:02
Fatal server error:
(EE) Server is already active for display 99
    If this server is no longer running, remove /tmp/.X99-lock
    and start again.
(EE)

Seems like the drush command is fine, but terminus decides to throw an error for some reason, and then a fatal error that aborts our CI.

greg-1-anderson commented 4 years ago

Is everything from the [error] through the second (EE) in red? Can't imagine why Terminus would fire up X11. Maybe it's the Terminus update checker? Terminus calls curl to check for the latest available version of Terminus; maybe headless chrome causes this call to behave differently, e.g. in some way that requires X11? Seems unlikely, but it's the only thing I have to go on right now.

Unfortunately, setting the hide_update_message configuration setting to true only hides the update message; Terminus still checks its latest version when this is set.

Another way to subvert the update check is to reroute stdout or stdin from Terminus.

Try:

terminus -n -vvv drush pantheon_project.env -- -y cr --debug < /dev/null

If that doesn't work, try redirecting stdout instead, although then you won't be able to see your output, so you'll have to add a check for $? being nonzero. Maybe try piping to tee so that you redirect output and can still see it. That could work.

If the workaround gets you past the terminus drush cr, we could fix Terminus so that it skips the version check if hide_update_message is set.

jcnventura commented 4 years ago

I'm not sure where that (EE) is coming from. It might be part of GitLab CI.

I'm worried about why drush or terminus exits with a non-zero error code (137 in this case). Drush has no reason to exit with an error code, as it had just finished with [success] Cache rebuild complete. We run several drush commands in sequence, so I'm guessing in the case above it managed to run terminus drush cr, but errored on terminus drush updb, 137 doesn't seem to be an acceptable error code from drush, so I'm guessing this is from terminus?

danielkorte commented 1 year ago

I’m also getting exit 137 in my CircleCI builds with image: cimg/php:8.1-browsers, but during a terminus drush deploy command:

terminus remote:drush --yes --no-interaction --progress -- $PANTHEON_SITE.$PANTHEON_ENV deploy

[notice] Database updates start.
[success] No pending updates.
[success] Cache rebuild start.
[success] Cache rebuild complete.
[success] Config import start.
[notice] There are no changes to import.
[success] Cache rebuild start.
[notice] Command: *******.dev -- drush deploy [Exit: 137]
[error]

aaronbauman commented 1 year ago

i was getting a 137 here as well, which corresponded to an apparently unrelated mysql error. slightly different setup, sounds like, but after rerunning the install command the 137 went away

namespacebrian commented 1 year ago

I'm hearing internal chatter that the platform restarts all of the site's services during code deployments, and that this can cause drush commands to error out. Have the folks experiencing problems here tried using build:workflow:wait between their code deployments and drush commands?

greg-1-anderson commented 1 year ago

n.b. terminus build:env:push implicitly calls terminus build:workflow:wait, so those of you who did not modify the default script commands should already be using it.

danielkorte commented 1 year ago

Would you still use terminus build:workflow:wait if your pantheon.yml file includes build_step: false and if so, where would it go?

greg-1-anderson commented 1 year ago

terminus build:env:push and terminus build:workflow:wait are only for use in projects where build_step is false.

If your build scripts already use terminus build:env:push, then you don't need build:workflow:wait, as it is already in use. If you use git push, then you should run terminus build:workflow:wait after the git push to ensure that the platform has processed the code push before you try to use said code.

greg-1-anderson commented 1 year ago

Also, I don't think that it was explicitly stated anywhere in the thread above, but an exit code 137 typically means "out of memory".

There is an alternate theory that the problem is caused by services restarting at the wrong time; while possible, I think this is unlikely. Following cccam's razor, the simplest explanation is that the command is failing because drush deploy is using a lot of memory, and it needed more than was available to php-fpm.

miiimooo commented 1 year ago

Thanks for that @greg-1-anderson

Slightly unrelated but I am seeing this error code on a Pantheon multidev in my deploy automation:

Notice: ] Database updates start.
 ---------------- ------------- --------------- ------------------------------- 
  Module           Update ID     Type            Description                    
 ---------------- ------------- --------------- ------------------------------- 
  system           10100         hook_update_n   10100 - Remove the year 2038   
                                                 date limitation.               
  block_content    10100         hook_update_n   10100 - Update entity          
                                                 definition to handle revision  
                                                 routes.                        
  block_content    10200         hook_update_n   10200 - Remove the unique      
                                                 values constraint from block   
                                                 content info fields.           
  dblog            10100         hook_update_n   10100 - Remove the year 2038   
                                                 date limitation.               
  dblog            10101         hook_update_n   10101 - Converts the 'wid' of  
                                                 the 'watchdog' table to a big  
                                                 integer.                       
  locale           10100         hook_update_n   10100 - Remove the year 2038   
                                                 date limitation.               
  statistics       10100         hook_update_n   10100 - Remove the year 2038   
                                                 date limitation.               
  user             10000         hook_update_n   10000 - Remove non-existent    
                                                 permissions created by         
                                                 migrations.                    
  block_content    block_libra   post-update     Update block_content 'block    
                   ry_view_per                   library' view permission.      
                   mission                                                      
  block_content    move_custom   post-update     Moves the custom block         
                   _block_libr                   library to Content.            
                   ary                                                          
  block_content    sort_permis   post-update     Update permissions for users   
                   sions                         with "administer blocks"       
                                                 permission.                    
  editor           image_lazy_   post-update     Enable filter_image_lazy_load  
                   load                          if editor_file_reference is    
                                                 enabled.                       
  file             add_permiss   post-update     Grant all non-anonymous roles  
                   ions_to_rol                   the 'delete own files'         
                   es                            permission.                    
  layout_builder   timestamp_f   post-update     Update timestamp formatter     
                   ormatter                      settings for Layout Builder    
                                                 fields.                        
  media            oembed_load   post-update     Add the oEmbed loading         
                   ing_attribu                   attribute setting to field     
                   te                            formatter instances.           
  system           enable_pass   post-update     Enable the password            
                   word_compat                   compatibility module.          
                   ibility                                                      
  system           linkset_set   post-update     Add new menu linkset endpoint  
                   tings                         setting.                       
  system           timestamp_f   post-update     Update timestamp formatter     
                   ormatter                      settings for entity view       
                                                 displays.                      
  text             allowed_for   post-update     Add allowed_formats setting    
                   mats                          to existing text fields.       
  views            boolean_cus   post-update     Update Views config schema to  
                   tom_titles                    make boolean custom titles     
                                                 translatable.                  
  views            fix_revisio   post-update     Fix '-revision_id'             
                   n_id_part                     replacement token syntax.      
  views            oembed_eage   post-update     Add eager load option to all   
                   r_load                        oembed type field              
                                                 configurations.                
  views            responsive_   post-update     Add lazy load options to all   
                   image_lazy_                   responsive image type field    
                   load                          configurations.                
  views            timestamp_f   post-update     Update timestamp formatter     
                   ormatter                      settings for views.            
 ---------------- ------------- --------------- ------------------------------- 
Notice: > >  [notice] Update started: block_content_update_10100
Notice: > >  [notice] Update completed: views_post_update_timestamp_formatter
...
Notice: ] Command: d8hrw.sidebyside -- drush deploy [Exit: 137]

It may be interesting for you that this is when upgrading a site from Drupal 9 to Drupal 10

Calling terminus -- drush deploy repeatedly eventually gets through but it's not pretty.

dpagini commented 1 year ago

Hey @miiimooo I just saw this comment pop up in my email, and I (or my company) has quite an extensive history with this “exit code 137” on Pantheon, so I thought I would weigh in with my experience here. Working with Pantheon, we seemed to have determined (not sure how well it’s documented) that this error code is happening b/c after a code push, the containers are not fully “ready” to run a command like $ drush updb. It doesn’t completely make sense to me b/c the command partially does run, but we have almost completely got rid of this issue by “waiting” after the deploy. The terminus “workflow:wait” command after a deploy made this error code almost completely disappear, but we’ve been talking lately that this command might not be 100% accurate, so we are currently trying an extra “sleep” after that command. It’s definitely not ideal, but that has solved this issue for my project… hope that helps?

pantheon-systems / terminus-build-tools-plugin

X server problems when running "terminus drush" prevent CI from running properly #356