tlc-pack / tlcpack

https://tlcpack.ai/
Apache License 2.0
23 stars 30 forks source link

Nightly Linux package in https://tlcpack.ai/wheels points to invalid URL #54

Closed leandron closed 2 years ago

leandron commented 3 years ago

There was a problem with last night Linux package sync. Trying to understand what happened, so that we can make the scripts more robust.

  1. It seems the workflow "Wheel-Manylinux-Nightly" failed, due to a GitHub upload timeout in https://github.com/tlc-pack/tlcpack/runs/2506719665?check_suite_focus=true#step:6:94, some packages, related to "Build (tlcpack-nightly, none, tlcpack/package-cpu:v0.3) " were already uploaded using a newer tag.

  2. So when "Prune-Nightly" kicked in, it deleted the previously generated packages, and replaced with the new version and committed it to the website: https://github.com/tlc-pack/tlc-pack.github.io/commit/808752e25f1e9a531d64510475c592dbaf4f6ce3#diff-2c60e44ee4bb531bb2a6175f4719186539796232913a75def24c302f310fa569

At this point we have some nightly which are older than others.

The main issue is that in parallel to all this, something happened at the point the upload of the python 3.6 cpu package tlcpack_nightly-0.8.dev959%2Bg26a5e299b-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl was being done, making it get corrupted somewhere. When you click on the "tlcpack_nightly-0.8.dev959%2Bg26a5e299b-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl" link on https://tlcpack.ai/wheels, it shows an error message, instead of the usual GitHub 404 page:

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>278489414/b0303080-ad4c-11eb-836c-79cb7dd07be3</Key>
<RequestId>6W80Z01V55Y1WYH6</RequestId>
<HostId>
F030SQ2/SBIeuSZYPxEN79QQpek8zTumFq0qPbRtc7VYLh7uu0KOBHPjdyQD07zdcqpwBexIRP8=
</HostId>
</Error>

One thing we could do to avoid that specific issue in future, is to run a quick health check for the URLs we are updating https://tlcpack.ai/wheels with. This would be in the context of wheel/wheel_prune_and_sync.py: https://github.com/tlc-pack/tlcpack/blob/f5d0a703d56c13216894a3d4e2d9adb071e60e09/wheel/wheel_prune_and_sync.py#L76

Using last night's case, it would be something as simple as a requests.get() call:

>>> import requests
>>> 
>>> r = requests.get("https://github.com/tlc-pack/tlcpack/releases/download/v0.7.dev1/tlcpack_nightly-0.8.dev959%2Bg26a5e299b-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl")
>>> r.status_code
200
>>> s = requests.get("https://github.com/tlc-pack/tlcpack/releases/download/v0.7.dev1/tlcpack_nightly-0.8.dev959%2Bg26a5e299b-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl")
>>> s.status_code
404
leandron commented 3 years ago

cc @hogepodge @tqchen

tqchen commented 3 years ago

@leandron good findings, seems should be an improvement of prune_and_sync, can you send a PR?

tqchen commented 3 years ago

We should also add timeout retry (with some backoff) to try to reupload the wheel if first attempt failed in the upload script

leandron commented 3 years ago

@leandron good findings, seems should be an improvement of prune_and_sync, can you send a PR?

I will send a PR with this change.

tqchen commented 3 years ago

@leandron gentle ping on sending the PR to add retry