Getting error on startup, but not always

stefandesu commented 4 years ago

I'm getting the following error most of the times when I try to start this up:

ssp    | [cont-init.d] 99-container: executing...
ssp    | **********************************************************************************************************************                                                         
ssp    | **********************************************************************************************************************                                                         
ssp    | ****                                                                                                              ****                                                         
ssp    | ****       ERROR - All scripts have not initialized properly - All services are now halted                        ****                                                         
ssp    | ****             - Please enter the container find out why the missing *-init state file hasn't been written      ****                                                         
ssp    | ****                                                                                                              ****                                                         
ssp    | **********************************************************************************************************************                                                         
ssp    | **********************************************************************************************************************

But when this error does not come up, SSP works fine.

Here's my docker-compose file:

version: "3"
services:
  ssp:
    image: tiredofit/self-service-password:latest
    container_name: ssp
    labels:
      - traefik.enable=true
      - traefik.frontend.rule=Host:ssp.${DOMAIN}
    volumes:
      - ./data/ssp-data:/www/ssp
      - ./data/ssp-logs:/www/logs
    environment:
      ## LDAP
      - LDAP_SERVER=ldap://openldap
      - LDAP_STARTTLS=false
      - LDAP_BINDDN=${LDAP_ADMIN_USER}
      - LDAP_BINDPASS=${LDAP_ADMIN_PASS}
      - LDAP_BASE_SEARCH=${LDAP_SEARCH_BASEDN}
      - LDAP_LOGIN_ATTRIBUTE=${LDAP_USERNAME_FIELD}
      - LDAP_FULLNAME_ATTRIBUTE=${LDAP_NAME_FIELD}
      ## SSP Password Settings
      - PASSWORD_HASH=CRYPT
      - PASSWORD_MIN_LENGTH=12
      - PASSWORD_NO_REUSE=true
      - PASSWORD_SHOW_POLICY=always
      - PASSWORD_SHOW_POLICY_POSITION=below
      ## SSP Other Settings
      - WHO_CAN_CHANGE_PASSWORD=user
      - QUESTIONS_ENABLED=false
      - IS_BEHIND_PROXY=true
      - SHOW_HELP=true
      - LANG=en
      - DEBUG_MODE=false
      - SECRETEKEY=${SSP_SECRETEKEY}
      - USE_RECAPTCHA=false
      - DEFAULT_ACTION=change
      - USE_TOKENS=false
    networks:
      - traefik
    restart: always
networks:
  traefik:
    external:
      name: traefik_webgateway

Can anyone help? I'd like to follow the hint "Please enter the container find out why the missing *-init state file hasn't been written" but I don't know where in the container to look. I've tried to look at log files but couldn't find anything.

Thanks!

tiredofit commented 4 years ago

Can I get the rest of that error banner? Or even better the full logs? If worried about privacy send to the email address that is listed in the Changelog..

stefandesu commented 4 years ago

Thanks for your reply, here you go. This is the output when I run docker-compose up, both a working and a non working log (both with the exact same docker-compose file and the same conditions, nothing changed between the runs as far as I know).

I also found out that this does not happen with version 4.0, but it happens with 4.1.0.

I hope that helps. If you need more verbose logs, I'd be grateful for a hint on how to get them, I'm pretty new to this whole docker thing.

Edit: I found this part at the end of /etc/cont-init.d/99-container:

[...]
  ### Final Sanity Check to make sure all scripts have executed and initialized properly, otherwise stop 
  files_init=`ls -l /etc/cont-init.d/* | grep -v ^d | wc -l`
  files_init=`expr $files_init - 1`
  init_complete=`ls -l /tmp/state/*-init | grep -v ^d | wc -l`

  if [ $files_init != $init_complete ]; then
    echo "**********************************************************************************************************************"
    echo "**********************************************************************************************************************"
    echo "****                                                                                                              ****"
    echo "****       ERROR - All scripts have not initialized properly - All services are now halted                        ****"
    echo "****             - Please enter the container find out why the missing *-init state file hasn't been written      ****" 
    echo "****                                                                                                              ****"
    echo "**********************************************************************************************************************"
    echo "**********************************************************************************************************************"
    echo ""
    echo "/etc/cont-init.d:"
    echo "`ls -x /etc/cont-init.d | sed 's#99-container##g'`"
    echo ""
    echo "/tmp/state:"
    echo "`ls -x /tmp/state/*-init`"
    echo ""

    for services in /var/run/s6/services/[0-9]* ; do
      s6-svc -d "$services"
    done
    exit 1
  fi
fi

In the log above, you'll see that the number for init_complete is actually higher than files_init. So maybe instead of !=, as bigger or equal than would be more appropriate?

micw commented 4 years ago

The error is reproducible if you kill the container (docker stop -t 0 containerid) and start it again.

micw commented 4 years ago

The root cause is:

the script counts number of init scripts to execute
the script counts number of state files (meaning how many services are actually started)
expects that script counts is one less than script count (because the script itself has no state file yet).

After a restart, /tmp/state is not clean, so it sees it's previous state file and fails.

Solution: use tmpfs for /tmp (on docker-compose add tmpfs: ["/tmp"] )

tiredofit commented 4 years ago

Sorry for the lack of reply, I didn't get a notification that you had and missed this entirely.

What you described definitely makes sense, that is the sanity test functionality that was put in with an update of my base images. There are a couple of things you can do immediately:

SKIP_SANITY_CHECK=TRUE should skip that routine and bypass it when restarting a container
Alternatively, docker-compose down/docker-compose up -d instead of docker-compose restart

Now, if worried about losing custom configuration and such there's also the capability of using /assets/custom as a volume mapped for every container start to overwrite the files in /www/ssp. Simply put your files into that folder and follow the normal file structure to have them copied/overwritten. Also, if you want to skip writing the configuration file for ssp you can also use SETUP_TYPE=MANUAL.

I think I'm going to have to find a way to determine if a container is cold/hot started to solve this once and for all. The Sanity test is quite useful when we have multiple services running - This nginx-php-fpm based one isn't too complex but I maintain a few others that have half a dozen services started and there for debugging purposes/support requests.

tiredofit commented 4 years ago

I should also add that version >5.0.0 supports the /assets/custom and SETUP_TYPE variable.

micw commented 4 years ago

@tiredofit both suggested workarounds have drawbacks:

SKIP_SANITY_CHECK=TRUE should skip that routine and bypass it when restarting a container

That would disable the check completely, so real startup failures won't be detected.

Alternatively, docker-compose down/docker-compose up -d instead of docker-compose restart

This is not an option if the host had an unclean restart (crash or hard reset) - in that case, the existing container is restarted by docker, leaving it in a non-working state.

You should fix your base image to clean /tmp/state before running the init scripts.

tiredofit commented 4 years ago

I just had a good think about it while walking the dog and have a potential solution, which involves doing some checks right at container start. Let me think this through further and try to implement within next 48 hours.

micw commented 4 years ago

Thank dog ;-)

tiredofit commented 4 years ago

Ha. I've made the change and updated the base images. Docker Hub is busy churning away and rebuilding everything relying on them. I'd expect that its going to be 24-36 hours before you'll be able to see.

What I do, is check for existence of /tmp/state and assume that the container has been hot restarted and delete the contents of the folder before proceeding. A notification is also produced in the event that it encounters this.

tiredofit commented 4 years ago

Docker Hub rebuilt the current latest. Can you give that a try and see if it solves the issue?

stefandesu commented 4 years ago

@tiredofit Seems to be working, thanks!

tiredofit / docker-self-service-password

Getting error on startup, but not always #20