rfeng2023 / mmcloud

0 stars 9 forks source link

Job FailedToExecute and FailedToComplete of failing to pull the status of the containers #75

Closed yiweizh-memverge closed 4 months ago

yiweizh-memverge commented 4 months ago

See some sample jobs below:

Failed to complete

https://54.81.85.209/#/opcenter/jobs/8hzjn4rh4euino8yyw45z

fagent.log

time="2024-04-17T10:27:14.725" level=info msg="inspect image digest" cid=561caf8a025cd535f90f6ca934679f082ccebfd839cf40a777e22a2dadef7211 digest="sha256:84f998b3653b6bc0c5ba787c5cca5d26cca2a84ad448618766b1688b758d07e1" id=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39 image=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39
time="2024-04-17T10:27:14.725" level=info msg="Request to sync os buffer cache"
time="2024-04-17T10:27:15.428" level=info msg="Collected logs of container" id=561caf8a025c
time="2024-04-17T10:27:20.209" level=info msg="Got container detailed info" checkpointAt="0001-01-01 00:00:00 +0000 UTC" checkpointed=false cid=561caf8a025cd535f90f6ca934679f082ccebfd839cf40a777e22a2dadef7211 exitCode=0 paused=false running=false status=exited
time="2024-04-17T10:27:22.806" level=info msg="inspect image digest" cid=561caf8a025cd535f90f6ca934679f082ccebfd839cf40a777e22a2dadef7211 digest="sha256:84f998b3653b6bc0c5ba787c5cca5d26cca2a84ad448618766b1688b758d07e1" id=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39 image=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39
time="2024-04-17T10:27:22.806" level=info msg="Request to sync os buffer cache"
time="2024-04-17T10:27:25.249" level=warning msg="Agent is going to shutdown"
time="2024-04-17T10:27:25.249" level=info msg="Ready to backup system logs"
time="2024-04-17T10:27:25.578" level=info msg="Collected logs of container" id=561caf8a025c
time="2024-04-17T10:27:31.081" level=info msg="Collected logs of container" id=561caf8a025c
time="2024-04-17T10:27:44.109" level=info msg="Collected logs of container" id=561caf8a025c
time="2024-04-17T10:28:01.31" level=info msg="Backup system log successfully" out=
time="2024-04-17T10:28:04.337" level=info msg="Collected logs of container" id=561caf8a025c
time="2024-04-17T10:28:15.282" level=info msg="Backup system log successfully" out=
time="2024-04-17T10:28:21.242" level=warning msg="Failed to wait log collector exit" error="Reach max retry count (code: 1033)"
time="2024-04-17T10:28:22.632" level=info msg="stopping the http(s) servers"
time="2024-04-17T10:28:23.723" level=info msg="http: Server closed" addr="0.0.0.0:443"
time="2024-04-17T10:28:23.011" level=info msg="Stopped log collector" id=561caf8a025c

From the log we can see that the container exited gracefully with code 0, but the opcenter didn't get the status of it that the it has been done, resulting in Failed to Complete.

Failed to execute

https://54.81.85.209/#/opcenter/jobs/v9o7wky28d58w3obrhmiw

fagent.log

cid=11064d466636e719927a8a7855d40d4da00a484f93e0b1dd1c01bb09e384f28d exitCode=0 paused=false running=true status=running
time="2024-04-17T14:50:27.664" level=info msg="Got container detailed info" checkpointAt="0001-01-01 00:00:00 +0000 UTC" checkpointed=false cid=11064d466636e719927a8a7855d40d4da00a484f93e0b1dd1c01bb09e384f28d exitCode=0 paused=false running=true status=running
time="2024-04-17T14:50:27.664" level=info msg="inspect image digest" cid=11064d466636e719927a8a7855d40d4da00a484f93e0b1dd1c01bb09e384f28d digest="sha256:84f998b3653b6bc0c5ba787c5cca5d26cca2a84ad448618766b1688b758d07e1" id=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39 image=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39
time="2024-04-17T14:50:44.024" level=info msg="inspect image digest" cid=11064d466636e719927a8a7855d40d4da00a484f93e0b1dd1c01bb09e384f28d digest="sha256:84f998b3653b6bc0c5ba787c5cca5d26cca2a84ad448618766b1688b758d07e1" id=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39 image=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39
time="2024-04-17T14:50:44.078" level=info msg="inspect image digest" cid=11064d466636e719927a8a7855d40d4da00a484f93e0b1dd1c01bb09e384f28d digest="sha256:84f998b3653b6bc0c5ba787c5cca5d26cca2a84ad448618766b1688b758d07e1" id=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39 image=eda1a5165cc8987798f0484b4cf621779c0da8017dfe2967e543a90a9be51f39
time="2024-04-17T14:51:51.59" level=info msg="Backup system log successfully" out=
time="2024-04-17T14:53:25.894" level=info msg="Backup system log successfully" out=
time="2024-04-17T14:53:36.991" level=warning msg="Failed to wait log collector exit" error="Reach max retry count (code: 1033)"
time="2024-04-17T14:53:37.981" level=info msg="stopping the http(s) servers"
time="2024-04-17T14:53:46.809" level=info msg="http: Server closed" addr="0.0.0.0:443"
time="2024-04-17T14:53:46.809" level=info msg="Server stopped"

The container has started running, but the opcenter didn't get its status so resulting in Failed to Execute.

Ashley-Tung commented 4 months ago

For Memverge: This is Ticket 3431

Ashley-Tung commented 4 months ago

Adding on section on the same job son 23*:

I also want to add on that Ru would submit the same job on the 23* opcenter on 4/16 and would get a different error. I have attached log bundles to some of the jobs, but it is about 200 jobs as well. She cancelled them because it seemed to hang.

| history.enabled                   | true                                                              | Y        |             |
| image.cachePath                   | s3://opcenter-bucket-a7c76ee0-b4a6-11ee-8b4f-0abb2ed0b8bf/images  | Y        |             |
| image.imageOpInterval             | 200ms                                                             | Y        | Y           |

I believe in this opcenter, the image.cachePath is pointing to a bucket.

The error I keep seeing for all the files is on the stderr and not from opcenter. I have checked their buckets and the files exist with no issue. No folder with the same names exist. I believe she has the 200+ jobs write to the same output file, which could be a concern

Failed to read ../output/Fungen_xQTL.cis_results_db.tsv: [Errno 2] No such file or directory: '../output/Fungen_xQTL.cis_results_db.tsv'
Ashley-Tung commented 4 months ago

[Update from Slack] The issue we were seeing on the 54* opcenter (from this issue: https://github.com/rfeng2023/mmcloud/issues/75) is likely due to the cachePath of the image, which is still pointing to the opcenter

| image.cachePath                   | file:///mnt/memverge/images

Because there were many jobs loading the image, this caused a network traffic issue, causing the jobs to not execute. TODO: What I suggest for the 54 opcenter, is creating a bucket for the image cache path. I can then upgrade the system variables to point there. After I do that, I can help to upgrade the 54 opcenter to 2.5.4 (just to confirm, this is the old east coast opcenter). @rfeng2023 please create a bucket with a images and releases folder and let me know the bucket name so I can set it up For the same jobs that ran on 23 opcenter, Ru confirmed that these are a lower priority job and that we will look at them at the beginning of May when the pipeline can be changed back to accomodate the susie_export runs (so that we may replicate them) Engineering and I did take a look at the script contents and we have a theory as to why the jobs consistently run into the issue of No such file or directory: '../output/Fungen_xQTL.cis_results_db.tsv'. Looking at the `cis_resultsexport` workflows in fine_mapping_post_processing.ipynb, we can see that step 3 removes files

 bash: expand = "${ }", container = container, stderr = f'{_input:n}.stderr', stdout = f'{_input:n}.stdout', entrypoint=entrypoint   
    rm -rf "${_input:ad}/${name}_cache/"
    rm -rf ${_input:an}.temp
And step 2 looks for those files
if (file.exists(paste0(${_output:anr},".temp"))) {
        library(tidyverse)

Of course, these files don't necessarily correspond to Fungen_xQTL.cis_results_db.tsv itself, AND we do not have the original pipeline script from when this error happened, HOWEVER, we are concerned about this script behavior, because if you have 200+ jobs running the same workflows that will remove/add/write to the SAME file, you are likely to get file issues because of this behavior. Please keep this in mind when we get back to debugging it, but this many jobs writing to the same file could cause unforeseen errors MMC cannot account for. Please let me know if you have any more questions or concerns.

Ashley-Tung commented 4 months ago

[Update] image

Ashley-Tung commented 4 months ago

Closing this issue for now, unless we see it re-appear. Both east coast opcenters should now be on 2.5.5

Ashley-Tung commented 4 months ago

For Memverge: This is ticket 3508