Closed gaow closed 5 months ago
This is documentation from job.events
:
We float from a r5.xlarge to a r5.2xlarge.
2024-03-11T06:35:00.095: Job status changed: [Executing -> Floating]
2024-03-11T06:35:00.095: Wave-riding from CPU 4 -> 8, mem 32 -> 64.
2024-03-11T06:35:10.004: Created instance i-032c4156a6a4261d7(r5.2xlarge-Spot) at us-east-1b, waiting for it to initialize
2024-03-11T06:36:55.124: Registered one new host
The job is requeued, but failed to restore
2024-03-11T06:39:26.035: Requeue job dkffnydjt265buoyonqci
2024-03-11T06:39:26.068: Found previous container f01926d63f15 with checkpoint, ready to resume
2024-03-11T06:39:29.187: Ready to resize float-data volume
2024-03-11T06:39:29.187: Ready to reclaim instance i-0423a3443070a7e8f after migration
2024-03-11T06:39:33.518: Requeue job dkffnydjt265buoyonqci
2024-03-11T06:39:33.550: Found previous container f01926d63f15 with checkpoint, ready to resume
2024-03-11T06:39:34.762: Requeue job dkffnydjt265buoyonqci
2024-03-11T06:39:34.794: Found previous container f01926d63f15 with checkpoint, ready to resume
2024-03-11T06:39:43.144: Failed to wait container started on i-032c4156a6a4261d7
2024-03-11T06:39:43.144: Job status changed: [Floating -> FailToComplete]. Mark job done.
2024-03-11T06:39:43.15: Wave-riding failed, error: Failed to restore: exit status 125 (code: 9011).
Looking at the restore logs (there are three, which lines up with the three "ready to resume" retries), these are the error messages from them respectively. restore.log.0
(01.238635) pie: 66: 66: Restored
(01.356291) 14550: Error (criu/files-reg.c:2199): File home/bst2126/output/fsusie/ROSMAP_mQTL.chr11_113210616_117772987.fsusie_mixture_normal_top_pc_weights.rds has bad size 228579328 (expect 101941248)
(01.356356) 14550: Error (criu/files.c:1213): Unable to open fd=4 id=0x18f
(01.358531) 14549: Error (criu/cr-restore.c:1510): 14550 exited, status=1
(01.434368) mnt: Switching to new ns to clean ghosts
(01.434454) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript4309.8qmkBg.cr.1.ghost
(01.434498) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript430a.7LkRIE.cr.2.ghost
(01.434517) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript430b.3TuKu5.cr.3.ghost
(01.434565) Unlink remap /tmp/.criu.mntns.dkAkn9/13-0000000000//tmp/Rscript430c.gdveYS.cr.4.ghost
(01.434688) Error (criu/cr-restore.c:2547): Restoring FAILED.
(01.435138) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: File exists
restore.log.1
(00.273414) 14550: Error (criu/files-reg.c:2199): File home/bst2126/output/fsusie/ROSMAP_mQTL.chr11_113210616_117772987.fsusie_mixture_normal_top_pc_weights.rds has bad size 228579328 (expect 101941248)
(00.273483) 14550: Error (criu/files.c:1213): Unable to open fd=4 id=0x18f
(00.275635) 14549: Error (criu/cr-restore.c:1510): 14550 exited, status=1
(00.730907) mnt: Switching to new ns to clean ghosts
(00.730952) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript4309.8qmkBg.cr.1.ghost
(00.730995) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript430a.7LkRIE.cr.2.ghost
(00.731014) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript430b.3TuKu5.cr.3.ghost
(00.731048) Unlink remap /tmp/.criu.mntns.tOCHTi/13-0000000000//tmp/Rscript430c.gdveYS.cr.4.ghost
(00.731155) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.731579) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
restore.log.2
(00.365275) 14550: Error (criu/files-reg.c:2199): File home/bst2126/output/fsusie/ROSMAP_mQTL.chr11_113210616_117772987.fsusie_mixture_normal_top_pc_weights.rds has bad size 228579328 (expect 101941248)
(00.365336) 14550: Error (criu/files.c:1213): Unable to open fd=4 id=0x18f
(00.367497) 14549: Error (criu/cr-restore.c:1510): 14550 exited, status=1
(00.383268) mnt: Switching to new ns to clean ghosts
(00.383360) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript4309.8qmkBg.cr.1.ghost
(00.383401) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript430a.7LkRIE.cr.2.ghost
(00.383419) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript430b.3TuKu5.cr.3.ghost
(00.383450) Unlink remap /tmp/.criu.mntns.oQt7tq/13-0000000000//tmp/Rscript430c.gdveYS.cr.4.ghost
(00.383557) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.383966) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
@Ashley-Tung thanks -- so this is till being investigated?
Yes, this is just documentation on what happened in the logs. A ticket has already been filed with Engineering
Hi @Ashley-Tung, just to add another job @rfeng2023 asked about this morning. This is also encountering Failed to Restore, adding the job details to the issue here. Job link: https://23.22.157.8/#/opcenter/jobs/hjazvb63r6rmpax3cdiux
job.events:
2024-03-13T15:28:15.394: Found previous container 84b14444f854 with checkpoint, ready to resume
2024-03-13T15:28:16.420: Requeue job hjazvb63r6rmpax3cdiux
2024-03-13T15:28:16.454: Found previous container 84b14444f854 with checkpoint, ready to resume
2024-03-13T15:28:27.306: Failed to poll job status from i-0f0b34a450e2cdd78, error Failed to restore: exit status 125 (code: 9011)
2024-03-13T15:28:27.402: Job status changed: [Executing -> FailToComplete]. Mark job done.
2024-03-13T15:28:27.402: Job failed. Failed to restore: exit status 125 (code: 9011)
restore.log.0
(01.501882) 293789: Error (criu/files-reg.c:2199): File home/bz2520/output/Fungen_xQTL.ENSG00000197360.cis_results_db.export.rds has bad size 6729962 (expect 2637824)
(01.502010) 293789: Error (criu/files.c:1213): Unable to open fd=4 id=0x176
(01.505754) 293788: Error (criu/cr-restore.c:1510): 293789 exited, status=1
(01.611406) mnt: Switching to new ns to clean ghosts
(01.611608) Error (criu/cr-restore.c:2547): Restoring FAILED.
(01.612141) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: File exists
restore.log.1
(00.327991) pie: 87: 87: Restored
(00.349266) 293789: Error (criu/files-reg.c:2199): File home/bz2520/output/Fungen_xQTL.ENSG00000197360.cis_results_db.export.rds has bad size 6729962 (expect 2637824)
(00.349359) 293789: Error (criu/files.c:1213): Unable to open fd=4 id=0x176
(00.353424) 293788: Error (criu/cr-restore.c:1510): 293789 exited, status=1
(00.503871) mnt: Switching to new ns to clean ghosts
(00.504138) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.504632) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
restore.log.2
(00.326949) pie: 87: 87: Restored
(00.347002) 293789: Error (criu/files-reg.c:2199): File home/bz2520/output/Fungen_xQTL.ENSG00000197360.cis_results_db.export.rds has bad size 6729962 (expect 2637824)
(00.348244) 293789: Error (criu/files.c:1213): Unable to open fd=4 id=0x176
(00.349106) 111: Create fd for 59
(00.349117) 111: Send fd 59 to /crtools-fd-293808-0
(00.349140) 111: Send fd 59 to /crtools-fd-293809-0
(00.349146) 293808: Receive fd for 1
(00.349162) 293808: Further fle=0x7f6608bf50c0, pid=293808
(00.349172) 293808: Receive fd for 2
(00.349175) 293808: Receive fd for 2
(00.349181) 101: Create fd for 56
(00.349187) 101: Send fd 56 to /crtools-fd-293828-0
(00.349202) 101: Send fd 56 to /crtools-fd-293829-0
(00.349213) 293828: Receive fd for 1
(00.349229) 293828: Further fle=0x7f6608bf6140, pid=293828
(00.349233) 293828: Receive fd for 2
(00.349236) 293828: Receive fd for 2
(00.349392) 98: Create fd for 50
(00.349406) 98: Send fd 50 to /crtools-fd-293848-0
(00.349428) 98: Send fd 50 to /crtools-fd-293849-0
(00.349433) 293848: Receive fd for 1
(00.349445) 293848: Further fle=0x7f6608bf3040, pid=293848
(00.349448) 293848: Receive fd for 2
(00.349451) 293848: Receive fd for 2
(00.351266) 293788: Error (criu/cr-restore.c:1510): 293789 exited, status=1
(00.597639) mnt: Switching to new ns to clean ghosts
(00.597841) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.598356) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
Another job, by Bo: https://23.22.157.8/#/opcenter/jobs/oqqdxvst691r8pib6ivjh No restore logs found, likely due to no snapshot being found on our last float?
2024-03-14T02:26:23.754: Failed to wait container started on i-08663d0a6d07c164c
2024-03-14T02:26:23.754: Job status changed: [Executing -> Floating]
2024-03-14T02:26:23.754: Ready to recover workload of i-08663d0a6d07c164c
2024-03-14T02:26:23.828: Instance i-08663d0a6d07c164c is down
2024-03-14T02:26:23.897: Attempt to find snapshot for job to recover
2024-03-14T02:26:23.898: No snapshots found for job to recover
Another job, https://23.22.157.8/#/opcenter/jobs/kxnqpyow6x0kr0k9jp06e
job.events
2024-03-14T18:14:21.29: Job is submitted to instance i-044f2ad36594ecce4 successfully
2024-03-14T18:14:21.294: Job status changed: [Floating -> Executing]
2024-03-14T18:14:21.295: Job floated to instance i-044f2ad36594ecce4 (8 CPU/64 GB) (Spot)
2024-03-14T18:14:21.329: Requeue job kxnqpyow6x0kr0k9jp06e
2024-03-14T18:14:21.364: Found previous container eb62f75a5c2c with checkpoint, ready to resume
2024-03-14T18:14:30.850: Requeue job kxnqpyow6x0kr0k9jp06e
2024-03-14T18:14:30.882: Found previous container eb62f75a5c2c with checkpoint, ready to resume
2024-03-14T18:14:31.932: Requeue job kxnqpyow6x0kr0k9jp06e
2024-03-14T18:14:31.964: Found previous container eb62f75a5c2c with checkpoint, ready to resume
2024-03-14T18:14:41.842: Failed to poll job status from i-044f2ad36594ecce4, error Failed to restore: exit status 125 (code: 9011)
2024-03-14T18:14:41.936: Job status changed: [Executing -> FailToComplete]. Mark job done.
2024-03-14T18:14:41.936: Job failed. Failed to restore: exit status 125 (code: 9011)
Restore.log.0
(01.838251) pie: 96: Task 23 exited, status= 1
(01.838310) pie: 95: Task 96 stopped with 19
(02.956999) mnt: Switching to new ns to clean ghosts
(02.957224) Error (criu/cr-restore.c:2547): Restoring FAILED.
(02.957810) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: File exists
Restore.log.1
(00.454682) 1402: Parsed 7effdd5f5000-7effdd6c2000 vma
(00.454684) 1402: Parsed 7effdd5f5000-7effdd6c4000 vma
(00.454686) 1402: Parsed 7effdd5f5000-7effdd6cb000 vma
(00.454687) 1402: Parsed 7effdd5f5000-7effdd6de000 vma
(00.454689) 1402: Parsed 7effdd5f5000-7effdd6e3000 vma
(00.454691) 1402: Parsed 7effdd5f5000-7effdd6e4000 vma
(00.508229) mnt: Switching to new ns to clean ghosts
(00.508395) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.508886) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
Restore.log.2
(00.485779) pie: 1402: mmap(0x7ffc9ed79000 -> 0x7ffc9ed7d000, 0x3 0x32 -1)
(00.485782) pie: 1402: mmap(0x7ffc9ed7d000 -> 0x7ffc9ed7f000, 0x7 0x32 -1)
(00.485785) pie: 1402: Preadv 0x55bb4e46a000:4096... (87 iovs)
(00.499350) pie: 96: Task 23 exited, status= 1
(00.499433) pie: 95: Task 96 stopped with 19
(00.536417) mnt: Switching to new ns to clean ghosts
(00.536667) Error (criu/cr-restore.c:2547): Restoring FAILED.
(00.537180) Error (criu/cgroup.c:1969): cg: cgroupd: recv req error: No such file or directory
I resubmitted the kxnqpyow6x0kr0k9jp06e job here and it was successful: https://23.22.157.8/#/opcenter/jobs/s1pdgzlr8d1paqulligdf
[UPDATE] Currently still being investigated. This error code 9011 appears for several jobs, but rerunning some of the jobs does not have the error appear again. The issue is suspected to be a syncing issue with s3fs.
[UPDATE] No fix yet, as this is still under investigation, but a fix is expected in 2.5.3
[UPDATE]
All of these failed jobs due to a failure to restore error of Bad File Size
is due to an s3fs bug. These jobs were run on the 2.4.1 second private build due to a sync error. We were able to match this error to a known error in a previous case. Version 2.5+ has updated s3fs and we have not see this issue since the upgrade.
@gaow I suggest we close this issue for now, but we will keep monitoring jobs to see if this pops up again. There are other code 9011 errors we have seen and we will be filing a ticket/Github issue soon, but those do not reflect the same restore error shown in these jobs above. Please let me know if you have any more question or concerns
Okay thanks @Ashley-Tung let's do what you suggest
@Ashley-Tung here is another error:
https://23.22.157.8/#/log?jobId=dkffnydjt265buoyonqci&logName=job.events