project8 / weevil

Rucio Demonstration
0 stars 0 forks source link

Possible issue with FTS renewal #1

Closed vbalbarin closed 2 years ago

vbalbarin commented 2 years ago

Following the documentation, I was able to complete the upload of the test files from the client pod to the XRD pods eventually.

I ran into issues with deploying the daemon pods using the values in the supplied daemons.yaml.

Executing the code helm install daemons rucio/rucio-daemons -f daemons.yaml would hang for 5 minutes before completion. Inspecting the pods through kubectl get pods would show the daemons in crash/restart loops.

I cleaned up the resources in this failed deployment:

# kubectl delete jobs --all
job.batch "daemons-renew-fts-proxy-27709572" deleted
job.batch "daemons-renew-fts-proxy-on-helm-install" deleted

# helm uninstall daemons
W0907 14:31:29.149833   62386 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
release "daemons" uninstalled
``

Re-installed daemons with option to save debug information to a file

```bash
helm install daemons rucio/rucio-daemons -f daemons.yaml --debug > "debug.$(date '+%H%M%S').txt" 2>&1

Appears to fail on creating Add/Modify event for daemons-renew-fts-proxy-on-helm-install.

The output of debugYYmmddHHMMSS.txt file confirms this:

install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /Users/vbalbarin/.config/cache/helm/repository/rucio-daemons-1.29.5.tgz

W0907 14:08:19.830611   61029 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
client.go:128: [debug] creating 23 resource(s)
W0907 14:08:19.911224   61029 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
client.go:128: [debug] creating 1 resource(s)
client.go:540: [debug] Watching for changes to Job daemons-renew-fts-proxy-on-helm-install with timeout of 5m0s
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: ADDED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition
helm.go:84: [debug] failed post-install: timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
    helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.4.0/command.go:902
main.main
    helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
    runtime/proc.go:250
runtime.goexit
    runtime/asm_amd64.s:1594

The other daemons appear to be in a perpetual crash-restart loop:

# kubectl get pods
NAME                                          READY   STATUS             RESTARTS        AGE
client                                        1/1     Running            0               21m
daemons-abacus-account-5744f549b7-gld6h       1/1     Running            0               16m
daemons-abacus-rse-77ffc79d65-k8lhb           1/1     Running            0               16m
daemons-conveyor-finisher-5998d75499-qtsrx    0/1     CrashLoopBackOff   7 (3m51s ago)   16m
daemons-conveyor-poller-5f89c8874c-wbmjw      0/1     CrashLoopBackOff   7 (4m6s ago)    16m
daemons-conveyor-submitter-6fd596846b-nrq4l   0/1     CrashLoopBackOff   7 (4m11s ago)   16m
daemons-judge-cleaner-75854cdccc-tv2gn        0/1     CrashLoopBackOff   7 (3m59s ago)   16m
daemons-judge-evaluator-85db44dbfc-9g6m2      0/1     CrashLoopBackOff   7 (3m54s ago)   16m
daemons-judge-injector-56ff469c5c-69767       0/1     CrashLoopBackOff   7 (3m44s ago)   16m
daemons-judge-repairer-7777bd5565-5rj5p       0/1     CrashLoopBackOff   7 (3m42s ago)   16m
daemons-undertaker-658bbfc56d-wkmxm           0/1     CrashLoopBackOff   7 (32s ago)     16m
fts-mysql-db7988d96-msr67                     1/1     Running            0               21m
fts-server-7cb5d7c789-ffxlq                   1/1     Running            0               20m
init                                          0/1     Completed          0               26m
postgres-postgresql-0                         1/1     Running            0               27m
server-rucio-server-7fffc4665d-42pxz          2/2     Running            0               23m
server-rucio-server-auth-6d5dd49947-wkbcz     2/2     Running            0               23m
xrd1                                          1/1     Running            0               21m
xrd2                                          1/1     Running            0               21m
xrd3                                          1/1     Running            0               21m

If we bring up the logs for any one of these crashing/restarting pods:

# kubectl logs daemons-conveyor-finisher-5998d75499-qtsrx
rucio.cfg not found. will generate one.
INFO:root:Merged 74 configuration values from /tmp/rucio.config.default.cfg
INFO:root:Merged 19 configuration values from /opt/rucio/etc/rucio.config.common.json
INFO:root:Merged 0 configuration values from /opt/rucio/etc/rucio.config.component.json
INFO:root:Merged 0 configuration values from ENV
starting daemon with: conveyor-finisher  --total-threads 1

Traceback (most recent call last):
  File "/usr/local/bin/rucio-conveyor-finisher", line 24, in <module>
    from rucio.daemons.conveyor.finisher import run, stop
  File "/usr/local/lib/python3.6/site-packages/rucio/daemons/conveyor/finisher.py", line 36, in <module>
    from rucio.core import request as request_core, replica as replica_core
  File "/usr/local/lib/python3.6/site-packages/rucio/core/request.py", line 34, in <module>
    from rucio.core.monitor import record_counter, record_timer
  File "/usr/local/lib/python3.6/site-packages/rucio/core/monitor.py", line 79, in <module>
    CLIENT = StatsClient(host=SERVER, port=PORT, prefix=SCOPE)
  File "/usr/local/lib/python3.6/site-packages/statsd/client/udp.py", line 35, in __init__
    host, port, fam, socket.SOCK_DGRAM)[0]
  File "/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

Some research into the source code suggested that these daemons might not be able to find an appropriate monitoring service inside the minikube local cluster.

Clean up broken resources and start again.

# kubectl delete jobs --all
job.batch "daemons-renew-fts-proxy-27709572" deleted
job.batch "daemons-renew-fts-proxy-on-helm-install" deleted

# helm uninstall daemons
W0907 14:31:29.149833   62386 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
release "daemons" uninstalled

Modify daemons.yaml by changing section

  monitor:
    carbon_server: "rucio"
    carbon_port: "8125"
    user_scope: "tutorial"

to

  monitor:
    carbon_server: "localhost"
    carbon_port: "8125"
    user_scope: "tutorial"

Re-install daemons:

helm install daemons rucio/rucio-daemons -f daemons.yaml --debug > "debug.$(date '+%H%M%S').txt" 2>&1

It still appears to get stuck at daemons-renew-fts-proxy-on-helm-install

# cat debugYYmmddHHMMSS.txt

client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:568: [debug] Add/Modify event for daemons-renew-fts-proxy-on-helm-install: MODIFIED
client.go:607: [debug] daemons-renew-fts-proxy-on-helm-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition
helm.go:84: [debug] failed post-install: timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
    helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.4.0/command.go:902
main.main
    helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
    runtime/proc.go:250
runtime.goexit
    runtime/asm_amd64.s:1594

However, the rest of the daemons seem to be working:

# kubectl get pods
NAME                                          READY   STATUS      RESTARTS   AGE
client                                        1/1     Running     0          42m
daemons-abacus-account-78974dd795-f4b7q       1/1     Running     0          9m29s
daemons-abacus-rse-7bfc7468df-vg2nc           1/1     Running     0          9m29s
daemons-conveyor-finisher-5bbfc845cf-krppn    1/1     Running     0          9m29s
daemons-conveyor-poller-6864fc575f-6xbj7      1/1     Running     0          9m29s
daemons-conveyor-submitter-75dfd87986-h2kjt   1/1     Running     0          9m29s
daemons-judge-cleaner-5868b5dc5c-k5nnj        1/1     Running     0          9m29s
daemons-judge-evaluator-854898bbfc-n4mkl      1/1     Running     0          9m29s
daemons-judge-injector-7cc699749-9zs7t        1/1     Running     0          9m29s
daemons-judge-repairer-6d85564ffc-6xfzj       1/1     Running     0          9m29s
daemons-undertaker-748d7f8776-x8f42           1/1     Running     0          9m29s
fts-mysql-db7988d96-msr67                     1/1     Running     0          42m
fts-server-7cb5d7c789-ffxlq                   1/1     Running     0          41m
init                                          0/1     Completed   0          47m
postgres-postgresql-0                         1/1     Running     0          48m
server-rucio-server-7fffc4665d-42pxz          2/2     Running     0          44m
server-rucio-server-auth-6d5dd49947-wkbcz     2/2     Running     0          44m
xrd1                                          1/1     Running     0          42m
xrd2                                          1/1     Running     0          42m
xrd3                                          1/1     Running     0          42m

This would suggest that these services require the value carbon_server: "localhost".

The error in FTS renewal seems to be a separate issue.

chicodelarosa commented 2 years ago

@vbalbarin Could you please open a pull request on these changes? Thanks.

vbalbarin commented 2 years ago

Hello Dan,

I haven’t put in a pull request until I checked in with you regarding validity of my observation.

I’ll do so.

I’ve created a dev branch from main and I will branch off features and bugs from there.

/V On Thu, Sep 8, 2022 at 12:08 PM Dan Alberto Rosa De Jesús < @.***> wrote:

@vbalbarin https://github.com/vbalbarin Could you please open a pull request on these changes? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/project8/weevil/issues/1#issuecomment-1240924828, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXLKNU3ICNGRQIHWA4PZXTV5IFP3ANCNFSM6AAAAAAQHBLB3M . You are receiving this because you were mentioned.Message ID: @.***>