sonic-net / SONiC

Landing page for Software for Open Networking in the Cloud (SONiC) - https://sonic-net.github.io/SONiC/
2.26k stars 1.13k forks source link

Processes inside Docker containers are not running #797

Open abhiranjeet opened 3 years ago

abhiranjeet commented 3 years ago

I have built docker images from azure/sonic-buildimage repository with PLATFORM=vs on ubuntu server 18.04 LTS. The build is successful with creating images for all components in /target directory. After loading those .gz docker images, I use the "docker run" commands to start all sonic containers one by one. Some of those containers start and some exit. But the one's which are running, have no processes running inside. Sharing snapshots below.

  1. /target directory image

  2. docker images image

  3. docker ps -a image

  4. An example : ssh into sonic-telemetry-vs container and check processes running and supervisord logs image

    root@5beef3a2100b:/# cat /var/log/supervisor/supervisord.log
    2021-06-13 05:59:48,566 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
    2021-06-13 05:59:48,566 INFO Set uid to user 0 succeeded
    2021-06-13 05:59:48,572 INFO RPC interface 'supervisor' initialized
    2021-06-13 05:59:48,572 CRIT Server 'unix_http_server' running without any HTTP authentication checking
    2021-06-13 05:59:48,573 INFO supervisord started with pid 1
    2021-06-13 05:59:49,576 INFO spawned: 'dependent-startup' with pid 9
    2021-06-13 05:59:49,579 INFO spawned: 'supervisor-proc-exit-listener' with pid 10
    2021-06-13 05:59:50,830 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2021-06-13 05:59:50,831 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2021-06-13 05:59:50,841 INFO spawned: 'rsyslogd' with pid 13
    2021-06-13 05:59:51,885 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2021-06-13 05:59:52,904 INFO spawned: 'start' with pid 17
    2021-06-13 05:59:52,904 INFO success: start entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
    2021-06-13 05:59:52,919 INFO exited: start (exit status 0; expected)
    2021-06-13 05:59:52,934 INFO spawned: 'telemetry' with pid 19
    2021-06-13 05:59:53,227 INFO exited: telemetry (exit status 255; not expected)
    2021-06-13 05:59:54,231 INFO spawned: 'telemetry' with pid 44
    2021-06-13 05:59:54,495 INFO exited: telemetry (exit status 255; not expected)
    2021-06-13 05:59:56,516 INFO spawned: 'telemetry' with pid 69
    2021-06-13 05:59:56,777 INFO exited: telemetry (exit status 255; not expected)
    2021-06-13 05:59:59,798 INFO spawned: 'telemetry' with pid 94
    2021-06-13 06:00:00,059 INFO exited: telemetry (exit status 255; not expected)
    2021-06-13 06:00:01,061 INFO gave up: telemetry entered FATAL state, too many start retries too quickly
mchomnic commented 3 years ago

Hi. Can you check /var/log/syslog for errors and fails? There you can find a informations, why dockers are failing.

abhiranjeet commented 3 years ago

Hi, I checked these logs

/usr/local/lib/python3.7/dist-packages/supervisor/options.py:474: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  'Supervisord is running as root and it is searching '
2021-06-13 05:59:48,566 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2021-06-13 05:59:48,566 INFO Set uid to user 0 succeeded
2021-06-13 05:59:48,572 INFO RPC interface 'supervisor' initialized
2021-06-13 05:59:48,572 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2021-06-13 05:59:48,573 INFO supervisord started with pid 1
2021-06-13 05:59:49,576 INFO spawned: 'dependent-startup' with pid 9
2021-06-13 05:59:49,579 INFO spawned: 'supervisor-proc-exit-listener' with pid 10
2021-06-13 05:59:50,830 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-06-13 05:59:50,831 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-06-13 05:59:50,841 INFO spawned: 'rsyslogd' with pid 13
2021-06-13 05:59:51,885 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-06-13 05:59:52,904 INFO spawned: 'start' with pid 17
2021-06-13 05:59:52,904 INFO success: start entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2021-06-13 05:59:52,919 INFO exited: start (exit status 0; expected)
2021-06-13 05:59:52,934 INFO spawned: 'telemetry' with pid 19
2021-06-13 05:59:53,227 INFO exited: telemetry (exit status 255; not expected)
2021-06-13 05:59:54,231 INFO spawned: 'telemetry' with pid 44
2021-06-13 05:59:54,495 INFO exited: telemetry (exit status 255; not expected)
2021-06-13 05:59:56,516 INFO spawned: 'telemetry' with pid 69
2021-06-13 05:59:56,777 INFO exited: telemetry (exit status 255; not expected)
2021-06-13 05:59:59,798 INFO spawned: 'telemetry' with pid 94
2021-06-13 06:00:00,059 INFO exited: telemetry (exit status 255; not expected)
2021-06-13 06:00:01,061 INFO gave up: telemetry entered FATAL state, too many start retries too quickly

Does this help ?

kylekyle commented 3 years ago

Same issue here, except I'm running on a physical switch. Logs look pretty much the same. All of the docker images are available, but none are running except docker-database. Any ideas on how to debug this?

abhiranjeet commented 3 years ago

Yeah. Later on I built an image for an Edgecore switch with one change PLATFORM=broadcom. You might see your database container running, but you have to cd into /usr/bin on that switch and look for a script named database.sh. Run that script using this command : ./database.sh start

kylekyle commented 3 years ago

Yarg. Didn't work for the Arista 7170 swi:

sudo ./database.sh start
Starting existing database container
database
Traceback (most recent call last):
  File "/usr/local/bin/sonic-cfggen", line 431, in <module>
    main()
  File "/usr/local/bin/sonic-cfggen", line 326, in main
    _process_json(args, data)
  File "/usr/local/bin/sonic-cfggen", line 237, in _process_json
    deep_update(data, FormatConverter.to_deserialized(json.load(stream)))
  File "/usr/lib/python3.7/json/__init__.py", line 296, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
True

I suspect that it is trying to parse /etc/sonic/config_db.json which is empty for some reason on a fresh install. Is that file meant to be populated manually?

abhiranjeet commented 3 years ago

Is your switch one of these ? image

kylekyle commented 3 years ago

Yep - the 7170-32.

mchomnic commented 3 years ago

Can you try to download and deploy one of those SONiC's image for Tofino ASIC? https://sonic-build.azurewebsites.net/ui/sonic/pipelines/146/builds?branchName=master

kylekyle commented 3 years ago

It look like my issue was unrelated to the OP's.

It turns out I am on 7170 32C - not 32CD. There were some issues with SKUs and port mapping names that were preventing things from loading properly. @Staphylo figured out what was going on and I'm up and running now.