rfeng2023 / mmcloud

0 stars 9 forks source link

Fail to restore error #61

Closed gaow closed 5 months ago

gaow commented 6 months ago

@Ashley-Tung here is another error:

https://23.22.157.8/#/log?jobId=dkffnydjt265buoyonqci&logName=job.events

Ashley-Tung commented 6 months ago

This is documentation from job.events:

We float from a r5.xlarge to a r5.2xlarge.

2024-03-11T06:35:00.095: Job status changed: [Executing -> Floating]
2024-03-11T06:35:00.095: Wave-riding from CPU 4 -> 8, mem 32 -> 64.
2024-03-11T06:35:10.004: Created instance i-032c4156a6a4261d7(r5.2xlarge-Spot) at us-east-1b, waiting for it to initialize
2024-03-11T06:36:55.124: Registered one new host

The job is requeued, but failed to restore

2024-03-11T06:39:26.035: Requeue job dkffnydjt265buoyonqci
2024-03-11T06:39:26.068: Found previous container f01926d63f15 with checkpoint, ready to resume
2024-03-11T06:39:29.187: Ready to resize float-data volume
2024-03-11T06:39:29.187: Ready to reclaim instance i-0423a3443070a7e8f after migration
2024-03-11T06:39:33.518: Requeue job dkffnydjt265buoyonqci
2024-03-11T06:39:33.550: Found previous container f01926d63f15 with checkpoint, ready to resume
2024-03-11T06:39:34.762: Requeue job dkffnydjt265buoyonqci
2024-03-11T06:39:34.794: Found previous container f01926d63f15 with checkpoint, ready to resume
2024-03-11T06:39:43.144: Failed to wait container started on i-032c4156a6a4261d7
2024-03-11T06:39:43.144: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-03-11T06:39:43.15: Wave-riding failed, error: Failed to restore: exit status 125 (code: 9011).

Looking at the restore logs (there are three, which lines up with the three "ready to resume" retries), these are the error messages from them respectively. restore.log.0

(01.238635) pie: 66: 66: Restored
(01.356291)  14550: Error (criu/files-reg.c:2199): File home/bst2126/output/fsusie/ROSMAP_mQTL.chr11_113210616_117772987.fsusie_mixture_normal_top_pc_weights.rds has bad size 228579328 (expect 101941248)
(01.356356)  14550: Error (criu/files.c:1213): Unable to open fd=4 id=0x18f
(01.358531)  14549: Error (criu/cr-restore.c:1510): 14550 exited, status=1
(01.434368) mnt: Switching to new ns to clean ghosts
(01.434454) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript4309.8qmkBg.cr.1.ghost
(01.434498) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript430a.7LkRIE.cr.2.ghost
(01.434517) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript430b.3TuKu5.cr.3.ghost
(01.434565) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript430c.gdveYS.cr.4.ghost
(01.434688) Error (criu/cr-restore.c:2547): Restoring FAILED.
(01.435138) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: File exists

restore.log.1

(00.273414)  14550: Error (criu/files-reg.c:2199): File home/bst2126/output/fsusie/ROSMAP_mQTL.chr11_113210616_117772987.fsusie_mixture_normal_top_pc_weights.rds has bad size 228579328 (expect 101941248)
(00.273483)  14550: Error (criu/files.c:1213): Unable to open fd=4 id=0x18f
(00.275635)  14549: Error (criu/cr-restore.c:1510): 14550 exited, status=1
(00.730907) mnt: Switching to new ns to clean ghosts
(00.730952) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript4309.8qmkBg.cr.1.ghost
(00.730995) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript430a.7LkRIE.cr.2.ghost
(00.731014) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript430b.3TuKu5.cr.3.ghost
(00.731048) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript430c.gdveYS.cr.4.ghost
(00.731155) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.731579) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory

restore.log.2

(00.365275)  14550: Error (criu/files-reg.c:2199): File home/bst2126/output/fsusie/ROSMAP_mQTL.chr11_113210616_117772987.fsusie_mixture_normal_top_pc_weights.rds has bad size 228579328 (expect 101941248)
(00.365336)  14550: Error (criu/files.c:1213): Unable to open fd=4 id=0x18f
(00.367497)  14549: Error (criu/cr-restore.c:1510): 14550 exited, status=1
(00.383268) mnt: Switching to new ns to clean ghosts
(00.383360) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript4309.8qmkBg.cr.1.ghost
(00.383401) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript430a.7LkRIE.cr.2.ghost
(00.383419) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript430b.3TuKu5.cr.3.ghost
(00.383450) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript430c.gdveYS.cr.4.ghost
(00.383557) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.383966) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
gaow commented 6 months ago

@Ashley-Tung thanks -- so this is till being investigated?

Ashley-Tung commented 6 months ago

Yes, this is just documentation on what happened in the logs. A ticket has already been filed with Engineering

yiweizh-memverge commented 6 months ago

Hi @Ashley-Tung, just to add another job @rfeng2023 asked about this morning. This is also encountering Failed to Restore, adding the job details to the issue here. Job link: https://23.22.157.8/#/opcenter/jobs/hjazvb63r6rmpax3cdiux

job.events:

2024-03-13T15:28:15.394: Found previous container 84b14444f854 with checkpoint, ready to resume
2024-03-13T15:28:16.420: Requeue job hjazvb63r6rmpax3cdiux
2024-03-13T15:28:16.454: Found previous container 84b14444f854 with checkpoint, ready to resume
2024-03-13T15:28:27.306: Failed to poll job status from i-0f0b34a450e2cdd78, error Failed to restore: exit status 125 (code: 9011)
2024-03-13T15:28:27.402: Job status changed: [Executing -> FailToComplete]. Mark job done.
2024-03-13T15:28:27.402: Job failed. Failed to restore: exit status 125 (code: 9011)

restore.log.0

(01.501882) 293789: Error (criu/files-reg.c:2199): File home/bz2520/output/Fungen_xQTL.ENSG00000197360.cis_results_db.export.rds has bad size 6729962 (expect 2637824)
(01.502010) 293789: Error (criu/files.c:1213): Unable to open fd=4 id=0x176
(01.505754) 293788: Error (criu/cr-restore.c:1510): 293789 exited, status=1
(01.611406) mnt: Switching to new ns to clean ghosts
(01.611608) Error (criu/cr-restore.c:2547): Restoring FAILED.
(01.612141) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: File exists

restore.log.1

(00.327991) pie: 87: 87: Restored
(00.349266) 293789: Error (criu/files-reg.c:2199): File home/bz2520/output/Fungen_xQTL.ENSG00000197360.cis_results_db.export.rds has bad size 6729962 (expect 2637824)
(00.349359) 293789: Error (criu/files.c:1213): Unable to open fd=4 id=0x176
(00.353424) 293788: Error (criu/cr-restore.c:1510): 293789 exited, status=1
(00.503871) mnt: Switching to new ns to clean ghosts
(00.504138) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.504632) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory

restore.log.2

(00.326949) pie: 87: 87: Restored
(00.347002) 293789: Error (criu/files-reg.c:2199): File home/bz2520/output/Fungen_xQTL.ENSG00000197360.cis_results_db.export.rds has bad size 6729962 (expect 2637824)
(00.348244) 293789: Error (criu/files.c:1213): Unable to open fd=4 id=0x176
(00.349106)    111:         Create fd for 59
(00.349117)    111:         Send fd 59 to /crtools-fd-293808-0
(00.349140)    111:         Send fd 59 to /crtools-fd-293809-0
(00.349146) 293808:     Receive fd for 1
(00.349162) 293808: Further fle=0x7f6608bf50c0, pid=293808
(00.349172) 293808:     Receive fd for 2
(00.349175) 293808:     Receive fd for 2
(00.349181)    101:         Create fd for 56
(00.349187)    101:         Send fd 56 to /crtools-fd-293828-0
(00.349202)    101:         Send fd 56 to /crtools-fd-293829-0
(00.349213) 293828:     Receive fd for 1
(00.349229) 293828: Further fle=0x7f6608bf6140, pid=293828
(00.349233) 293828:     Receive fd for 2
(00.349236) 293828:     Receive fd for 2
(00.349392)     98:         Create fd for 50
(00.349406)     98:         Send fd 50 to /crtools-fd-293848-0
(00.349428)     98:         Send fd 50 to /crtools-fd-293849-0
(00.349433) 293848:     Receive fd for 1
(00.349445) 293848: Further fle=0x7f6608bf3040, pid=293848
(00.349448) 293848:     Receive fd for 2
(00.349451) 293848:     Receive fd for 2
(00.351266) 293788: Error (criu/cr-restore.c:1510): 293789 exited, status=1
(00.597639) mnt: Switching to new ns to clean ghosts
(00.597841) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.598356) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
Ashley-Tung commented 6 months ago

Another job, by Bo: https://23.22.157.8/#/opcenter/jobs/oqqdxvst691r8pib6ivjh No restore logs found, likely due to no snapshot being found on our last float?

2024-03-14T02:26:23.754: Failed to wait container started on i-08663d0a6d07c164c
2024-03-14T02:26:23.754: Job status changed: [Executing -> Floating]
2024-03-14T02:26:23.754: Ready to recover workload of i-08663d0a6d07c164c
2024-03-14T02:26:23.828: Instance i-08663d0a6d07c164c is down
2024-03-14T02:26:23.897: Attempt to find snapshot for job to recover
2024-03-14T02:26:23.898: No snapshots found for job to recover
yiweizh-memverge commented 6 months ago

Another job, https://23.22.157.8/#/opcenter/jobs/kxnqpyow6x0kr0k9jp06e

job.events

2024-03-14T18:14:21.29: Job is submitted to instance i-044f2ad36594ecce4 successfully
2024-03-14T18:14:21.294: Job status changed: [Floating -> Executing]
2024-03-14T18:14:21.295: Job floated to instance i-044f2ad36594ecce4 (8 CPU/64 GB) (Spot)
2024-03-14T18:14:21.329: Requeue job kxnqpyow6x0kr0k9jp06e
2024-03-14T18:14:21.364: Found previous container eb62f75a5c2c with checkpoint, ready to resume
2024-03-14T18:14:30.850: Requeue job kxnqpyow6x0kr0k9jp06e
2024-03-14T18:14:30.882: Found previous container eb62f75a5c2c with checkpoint, ready to resume
2024-03-14T18:14:31.932: Requeue job kxnqpyow6x0kr0k9jp06e
2024-03-14T18:14:31.964: Found previous container eb62f75a5c2c with checkpoint, ready to resume
2024-03-14T18:14:41.842: Failed to poll job status from i-044f2ad36594ecce4, error Failed to restore: exit status 125 (code: 9011)
2024-03-14T18:14:41.936: Job status changed: [Executing -> FailToComplete]. Mark job done.
2024-03-14T18:14:41.936: Job failed. Failed to restore: exit status 125 (code: 9011)

Restore.log.0

(01.838251) pie: 96: Task 23 exited, status= 1
(01.838310) pie: 95: Task 96 stopped with 19
(02.956999) mnt: Switching to new ns to clean ghosts
(02.957224) Error (criu/cr-restore.c:2547): Restoring FAILED.
(02.957810) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: File exists

Restore.log.1

(00.454682)   1402: Parsed 7effdd5f5000-7effdd6c2000 vma
(00.454684)   1402: Parsed 7effdd5f5000-7effdd6c4000 vma
(00.454686)   1402: Parsed 7effdd5f5000-7effdd6cb000 vma
(00.454687)   1402: Parsed 7effdd5f5000-7effdd6de000 vma
(00.454689)   1402: Parsed 7effdd5f5000-7effdd6e3000 vma
(00.454691)   1402: Parsed 7effdd5f5000-7effdd6e4000 vma
(00.508229) mnt: Switching to new ns to clean ghosts
(00.508395) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.508886) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory

Restore.log.2

(00.485779) pie: 1402:  mmap(0x7ffc9ed79000 -> 0x7ffc9ed7d000, 0x3 0x32 -1)
(00.485782) pie: 1402:  mmap(0x7ffc9ed7d000 -> 0x7ffc9ed7f000, 0x7 0x32 -1)
(00.485785) pie: 1402: Preadv 0x55bb4e46a000:4096... (87 iovs)
(00.499350) pie: 96: Task 23 exited, status= 1
(00.499433) pie: 95: Task 96 stopped with 19
(00.536417) mnt: Switching to new ns to clean ghosts
(00.536667) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.537180) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
Ashley-Tung commented 6 months ago

https://23.22.157.8/#/opcenter/jobs/y2tjtyt376fvpkxzos19z

Ashley-Tung commented 6 months ago

I resubmitted the kxnqpyow6x0kr0k9jp06e job here and it was successful: https://23.22.157.8/#/opcenter/jobs/s1pdgzlr8d1paqulligdf

Ashley-Tung commented 6 months ago

[UPDATE] Currently still being investigated. This error code 9011 appears for several jobs, but rerunning some of the jobs does not have the error appear again. The issue is suspected to be a syncing issue with s3fs.

Ashley-Tung commented 5 months ago

[UPDATE] No fix yet, as this is still under investigation, but a fix is expected in 2.5.3

Ashley-Tung commented 5 months ago

[UPDATE] All of these failed jobs due to a failure to restore error of Bad File Size is due to an s3fs bug. These jobs were run on the 2.4.1 second private build due to a sync error. We were able to match this error to a known error in a previous case. Version 2.5+ has updated s3fs and we have not see this issue since the upgrade.

@gaow I suggest we close this issue for now, but we will keep monitoring jobs to see if this pops up again. There are other code 9011 errors we have seen and we will be filing a ticket/Github issue soon, but those do not reflect the same restore error shown in these jobs above. Please let me know if you have any more question or concerns

gaow commented 5 months ago

Okay thanks @Ashley-Tung let's do what you suggest