iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 571 forks source link

`create_image.sh` fails when base image requires `google-guest-agent` upgrade #13921

Open pzread opened 1 year ago

pzread commented 1 year ago

create_image.sh fails to create a new image from the base image ubuntu-2204-jammy-v20230114. From the log, it seems like the apt-get upgrade updated the package google-guest-agent, and it killed the google-startup-scripts.service during the process, which interrupted the setup script.

google-startup-scripts.service logs:

Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Preparing to unpack .../37-curl_7.81.0-1ubuntu1.10_amd>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Unpacking curl (7.81.0-1ubuntu1.10) over (7.81.0-1ubun>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Preparing to unpack .../38-libcurl4_7.81.0-1ubuntu1.10>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Unpacking libcurl4:amd64 (7.81.0-1ubuntu1.10) over (7.>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Preparing to unpack .../39-libcurl3-gnutls_7.81.0-1ubu>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Unpacking libcurl3-gnutls:amd64 (7.81.0-1ubuntu1.10) o>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: 2023/06/02 17:55:31 logging client: rpc error: code = Unauthenticated >
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Preparing to unpack .../40-git-man_1%3a2.34.1-1ubuntu1>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Unpacking git-man (1:2.34.1-1ubuntu1.9) over (1:2.34.1>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Preparing to unpack .../41-git_1%3a2.34.1-1ubuntu1.9_a>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Unpacking git (1:2.34.1-1ubuntu1.9) over (1:2.34.1-1ub>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 google_metadata_script_runner[1880]: startup-script: Preparing to unpack .../42-google-guest-agent_20220622>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Main process exited, code=killed, status=15/TERM
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Failed with result 'signal'.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 1886 (startup-script) remains running after unit s>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 1891 (startup-script) remains running after unit s>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 1892 (tee) remains running after unit stopped.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 3269 (apt-get) remains running after unit stopped.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 5299 (dpkg) remains running after unit stopped.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 5300 (sh) remains running after unit stopped.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 5301 (sh) remains running after unit stopped.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 5302 (dpkg-status) remains running after unit stop>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 5948 (preinst) remains running after unit stopped.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Unit process 5953 (systemctl) remains running after unit stoppe>
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: Stopped Google Compute Engine Startup Scripts.
Jun 02 17:55:31 github-runner-template-cpu-2023-06-02-1685728458 systemd[1]: google-startup-scripts.service: Consumed 22.378s CPU time.
pzread commented 1 year ago

Confirmed that it's related to the google-guest-agent. I tried to apt-get install google-guest-agent first and it failed to do that:

Jun 02 18:02:40 github-runner-template-cpu-2023-06-02-1685728901 google_metadata_script_runner[1953]: startup-script: + apt-get install google-guest-agent
Jun 02 18:02:41 github-runner-template-cpu-2023-06-02-1685728901 google_metadata_script_runner[1953]: 2023/06/02 18:02:41 logging client: rpc error: code = Unauthenticated >
Jun 02 18:02:47 github-runner-template-cpu-2023-06-02-1685728901 google_metadata_script_runner[1953]: startup-script: (Reading database ... 64301 files and directories curr>
Jun 02 18:02:47 github-runner-template-cpu-2023-06-02-1685728901 google_metadata_script_runner[1953]: startup-script: Preparing to unpack .../google-guest-agent_20220622.00>
Jun 02 18:02:47 github-runner-template-cpu-2023-06-02-1685728901 systemd[1]: google-startup-scripts.service: Main process exited, code=killed, status=15/TERM
Jun 02 18:02:47 github-runner-template-cpu-2023-06-02-1685728901 systemd[1]: google-startup-scripts.service: Failed with result 'signal'.
GMNGeoffrey commented 1 year ago

Seems like you updated the base image so it's got the up to date guest agent: https://github.com/openxla/iree/pull/13918. Not sure how we avoid this in the future other than just bumping that again. Some searching indicates others who've encountered similar issues but no resolutions. We should probably be updating the base image when we make new VM images anyway. For reproducibility of the existing image, the image_setup.sh script still works: you just can't run it as a startup script. One option would be for the script to try to invoke itself (with disown?) after doing an upgrade so that it can continue even if upgrading kills it. A bit tricky because I don't think the script actually lives in any file when used as a startup script.

pzread commented 1 year ago

Maybe documenting this (remember to bump the base image) somewhere will be good enough. I spent a while trying to figure out what happened.

GMNGeoffrey commented 1 year ago

Yeah a comment seems worth it at least. One option would be to add an exit trap that's started right before the apt-get upgrade command and ended right after. Then if the script gets killed in there, it can at least provide a helpful message