open-horizon / devops

Devops processes to build and deploy horizon components
Apache License 2.0
10 stars 43 forks source link

Agbot container stuck in "restarting" status #93

Closed Al-tekreeti closed 2 years ago

Al-tekreeti commented 2 years ago

I am trying to setup all-in-one management hub in a ubuntu 18.04 machine when I got this issue. I tried stopping and restarting using the flags -S and -s, and also I tried stopping and purging using -SP flags then setting it up again, but unfortunately without success. The logs do show an issue of not able to initialize the agbot:

2021-10-07T15:51:22.705263071Z I1007 15:51:22.705147 10 main.go:65] Using config: Edge: {ServiceStorage , APIListen 127.0.0.1:8510, DBPath , DockerEndpoint , DockerCredFilePath , DefaultCPUSet , DefaultServiceRegistrationRAM: 0, StaticWebContent: , PublicKeyPath: , TrustSystemCACerts: false, CACertsPath: , ExchangeURL: , DefaultHTTPClientTimeoutS: 30, PolicyPath: , ExchangeHeartbeat: 0, AgreementTimeoutS: 0, DVPrefix: , RegistrationDelayS: 0, ExchangeMessageTTL: 0, ExchangeMessageDynamicPoll: true, ExchangeMessagePollInterval: 20, ExchangeMessagePollMaxInterval: 120, ExchangeMessagePollIncrement: 20, UserPublicKeyPath: , ReportDeviceStatus: false, TrustCertUpdatesFromOrg: false, TrustDockerAuthFromOrg: false, ServiceUpgradeCheckIntervalS: 300, MultipleAnaxInstances: false, DefaultServiceRetryCount: 2, DefaultServiceRetryDuration: 600, NodeCheckIntervalS: 15, FileSyncService: {APIListen: , APIPort: 0, APIProtocol: , PersistencePath: , AuthenticationPath: , CSSURL: , CSSSSLCert: , PollingRate: 0, ObjectQueueBufferSize: 0}, InitialPollingBuffer: {120}, BlockchainAccountId: , BlockchainDirectoryAddress }, AgreementBot: {TxLostDelayTolerationSeconds: 120, AgreementWorkers: 5, DBPath: , Postgresql: {Host: postgres, Port: 5432, User: admin, Password: **, DBName: exchange, SSLMode: disable MaxOpenConnections: 20}, PartitionStale: 0, ProtocolTimeoutS: 120, AgreementTimeoutS: 360, NoDataIntervalS: 300, ActiveAgreementsURL: , ActiveAgreementsUser: , ActiveAgreementsPW: **, PolicyPath: /home/agbotuser/policy.d/, NewContractIntervalS: 5, ProcessGovernanceIntervalS: 5, IgnoreContractWithAttribs: ethereum_account, ExchangeURL: http://exchange-api:8080/v1/, ExchangeHeartbeat: 5, ExchangeId: IBM/agbot, ExchangeToken: **, DVPrefix: , ActiveDeviceTimeoutS: 180, ExchangeMessageTTL: 1800, MessageKeyPath: msgKey, DefaultWorkloadPW: **, APIListen: 0.0.0.0:8080, SecureAPIListenHost: 0.0.0.0, SecureAPIListenPort: 8083, SecureAPIServerCert: , SecureAPIServerkey: , PurgeArchivedAgreementHours: 1, CheckUpdatedPolicyS: 7, CSSURL: http://css-api:8080, CSSSSLCert: , AgreementBatchSize: 300, AgreementQueueSize: 300, MessageQueueScale: 33, QueueHistorySize: 30, FullRescanS: 600, MaxExchangeChanges: 1000, RetryLookBackWindow: 3600, PolicySearchOrder: true, Vault: {{http://vault:8200 }}}, Collaborators: {HTTPClientFactory: &{0x7556a0 0 10}, KeyFileNamesFetcher: &{0x755e90}}, ArchSynonyms: {map[aarch64:arm64 armhf:arm x86_64:amd64]} 2021-10-07T15:51:22.705524354Z I1007 15:51:22.705481 10 main.go:66] GOMAXPROCS: 1 2021-10-07T15:51:22.705587108Z I1007 15:51:22.705563 10 init.go:22] Connecting to Postgresql database: host=postgres port=5432 user=admin dbname=exchange sslmode=disable 2021-10-07T15:51:22.708502419Z I1007 15:51:22.708469 10 init.go:40] Agreementbot 7fd4aea5-6a4d-4924-a59f-33e11d3b5fd5 initializing partitions 2021-10-07T15:51:22.708554453Z I1007 15:51:22.708535 10 init.go:43] Postgresql database tables initializing. 2021-10-07T15:51:22.736060867Z panic: Unable to initialize Agreement Bot: unable to claim a partition, error: unable to claim an unowned partition, error: unable to claim stale, error: pq: query is not a SELECT 2021-10-07T15:51:22.736113111Z 2021-10-07T15:51:22.736330618Z goroutine 1 [running]: 2021-10-07T15:51:22.736503923Z main.main() 2021-10-07T15:51:22.736722773Z /tmp/anax-gopath/src/github.com/open-horizon/anax/main.go:92 +0x210b

If this is the issue, what might be the problem? Your help is highly appreciated!

linggao commented 2 years ago

@Al-tekreeti Is the arch amd64? Did the postgres container start in a healthy condition? I have this on my ubuntu 18 node, (NAME="Ubuntu" VERSION="18.04.5 LTS (Bionic Beaver)").

$docker ps |grep postgres
f1596ba35327   postgres:9                                    "docker-entrypoint.s…"   50 minutes ago   Up 50 minutes (healthy)   5432/tcp    
Al-tekreeti commented 2 years ago

@linggao Yes, all containers are healthy except Agbot. Sometimes vault container goes unhealthy, but if I restart the system using the flags -S and -s, it becomes healthy. The problem is only in Agbot container that keeps in a restarting state (does not reach unhealthy state even). It is ubuntu 20.04 not 18.04 as I mentioned earlier, sorry about that. Please see the following commands outputs:

$ uname -m x86_64

$docker ps |grep postgres 3f462df198da postgres:latest "docker-entrypoint.s…" 4 days ago Up 9 minutes (healthy) 5432/tcp postgres

$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

linggao commented 2 years ago

@Al-tekreeti Can you show me how you start the all-in-one management hub? Your env variables for it?

Al-tekreeti commented 2 years ago

I forked the repo then I started it using the following command: $ sudo ./deploy-mgmt-hub.sh

I don't pass any environment variables to it, but when the command finish execution it prints out the following tokens:

EXCHANGE_ROOT_PW=TEVUNLQzuQ3KNjYDfwveyEmjpnDZjU EXCHANGE_HUB_ADMIN_PW=1W5VhfVxqja8e3hGYJyB8vepbWIsNO EXCHANGE_SYSTEM_ADMIN_PW=qmbM5b6XAPI1cpHbLUBFMTPSVFv7lS AGBOT_TOKEN=SMDCzgY8b8Irl7xdl5t6aAgYdcHp6F EXCHANGE_USER_ADMIN_PW=Pe2UlU2RSinxeiDOp4fmEiXiAWNIB7 HZN_DEVICE_TOKEN=Z8Z3BbnDjlrUTYwZcEKGqSrRVJngkQ VAULT_UNSEAL_KEY=BcbKa05euWxuWdkocZGvS2yijhvTumt236EFK1bloEE= VAULT_ROOT_TOKEN=s.7zoHVzZNCIN0iCRzN9M1f0rg

linggao commented 2 years ago

@Al-tekreeti I think it is the version of the postgres image (postgres:latest) that caused the problem. Please try to start the management hub with the following command for now.

export POSTGRES_IMAGE_TAG=9
sudo -sE ./deploy-mgmt-hub.sh
(where -E is used to preserve the env variable of the parent)
Al-tekreeti commented 2 years ago

@linggao Did not fix the issue and it still pulling postgres with "latest" tag.

linggao commented 2 years ago

@Al-tekreeti That means the POSTGRES_IMAGE_TAG env variable is not taken. Please make sure you clean all the leftover containers before restart. (docker ps -a) And please make sure to use the flag -sE for sudo command. Or you can use the following command:

sudo POSTGRES_IMAGE_TAG=9 ./deploy-mgmt-hub.sh

Anyway, find a way in your env to make the script take the POSTGRES_IMAGE_TAG env variable. It seems that the latest release of postgres image broke the code.

Al-tekreeti commented 2 years ago

@linggao It did fix the issue. All containers now are healthy. Thanks a lot!

linggao commented 2 years ago

@Al-tekreeti Good to hear. Btw, I tested POSTGRES_IMAGE_TAG=13, it works too. It is the version 14 which was released on 09/30/2021 that broke the code. We will stay with 13 for now. I will make a change in the deploy-mgmt-hub.sh to use 13 as default. Thanks for reporting this issue.

Al-tekreeti commented 2 years ago

You welcome! closing the issue.