mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue #806

Closed eu9ene closed 1 month ago

eu9ene commented 2 months ago

The task has finished but Taskcluster couldn't upload the artifacts.

[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557

It might be fixed with migration to the new CPU pool.

https://firefox-ci-tc.services.mozilla.com/tasks/MHMMjOrnQ0e__tn8zEypAQ/runs/1/logs/public/logs/live.log

eu9ene commented 1 month ago

One more https://firefox-ci-tc.services.mozilla.com/tasks/Vl2IihjYTp6GRHZ31eQ7qw/runs/0/logs/public/logs/live.log

eu9ene commented 1 month ago

Two more: https://firefox-ci-tc.services.mozilla.com/tasks/ZBUKo9_KTEqcQarIGHlALg/runs/1/logs/public/logs/live.log https://firefox-ci-tc.services.mozilla.com/tasks/TZamYV-bQQaDC7tq8wkbkQ/runs/1/logs/public/logs/live.log

@bhearsum please take a look. It seems it's blocking proceeding with several remaining models. I think it doesn't go away after restart.

eu9ene commented 1 month ago

I see we upload tmp directory which shouldn't be uploaded. Can it be the reason? I see you attempted to fix it here https://github.com/mozilla/firefox-translations-training/pull/755. I can give it another try in release branch if this is the reason of failure.

bhearsum commented 1 month ago

It might be fixed with migration to the new CPU pool.

Notably, this is happening on the new CPU workers.

I see you attempted to fix it here https://github.com/mozilla/firefox-translations-training/pull/755. I can give it another try in release branch if this is the reason of failure.

I don't think I was encountering this error when I opened that - I was just trying to avoid uploading what I thought were unused files, so this pull request isn't likely to be relevant.

The relevant bits of the log are:

[taskcluster 2024-09-06T15:59:01.289Z] === Task Finished ===
[taskcluster 2024-09-06T15:59:01.289Z] Task Duration: 61h9m58.772398789s
[taskcluster 2024-09-06T16:01:07.700Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725639667.664
[taskcluster 2024-09-06T16:18:07.742Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725640687.705
[taskcluster 2024-09-06T16:35:07.777Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725641707.743
[taskcluster 2024-09-06T16:52:07.824Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725642727.782
[taskcluster 2024-09-06T16:53:35.651Z] Uploading artifact public/build/tmp/aln.rev.ad from file /task_172541804246164/artifact0/tmp/aln.rev.ad with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.656Z] Uploading artifact public/build/tmp/aln.fwd from file /task_172541804246164/artifact0/tmp/aln.fwd with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.658Z] Uploading artifact public/build/tmp/aln.rev from file /task_172541804246164/artifact0/tmp/aln.rev with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.659Z] Uploading artifact public/build/tmp/aln.rev.af from file /task_172541804246164/artifact0/tmp/aln.rev.af with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.659Z] Uploading artifact public/build/tmp/aln.rev.ab from file /task_172541804246164/artifact0/tmp/aln.rev.ab with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.660Z] Uploading artifact public/build/tmp/aln from file /task_172541804246164/artifact0/tmp/aln with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.660Z] Uploading artifact public/build/tmp/aln.rev.ac from file /task_172541804246164/artifact0/tmp/aln.rev.ac with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.662Z] Uploading artifact public/build/tmp/aln.fwd.ab from file /task_172541804246164/artifact0/tmp/aln.fwd.ab with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.662Z] Uploading artifact public/build/tmp/aln.fwd.aa from file /task_172541804246164/artifact0/tmp/aln.fwd.aa with content encoding "gzip", mime type "audio/x-pn-audibleaudio" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.664Z] Uploading artifact public/build/tmp/aln.fwd.ac from file /task_172541804246164/artifact0/tmp/aln.fwd.ac with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.665Z] Uploading artifact public/build/tmp/aln.rev.aa from file /task_172541804246164/artifact0/tmp/aln.rev.aa with content encoding "gzip", mime type "audio/x-pn-audibleaudio" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.667Z] Uploading artifact public/build/corpus.aln.zst from file /task_172541804246164/artifact0/corpus.aln.zst with content encoding "identity", mime type "application/zstd" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.678Z] Uploading artifact public/build/tmp/aln.fwd.ae from file /task_172541804246164/artifact0/tmp/aln.fwd.ae with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.679Z] Uploading artifact public/build/tmp/aln.rev.ae from file /task_172541804246164/artifact0/tmp/aln.rev.ae with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.681Z] Uploading artifact public/build/tmp/aln.fwd.ad from file /task_172541804246164/artifact0/tmp/aln.fwd.ad with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T16:53:35.693Z] Uploading artifact public/build/tmp/aln.fwd.af from file /task_172541804246164/artifact0/tmp/aln.fwd.af with content encoding "gzip", mime type "application/octet-stream" and expiry 2024-12-03T00:21:16.511Z
[taskcluster 2024-09-06T17:09:07.863Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725643747.827
[taskcluster 2024-09-06T17:26:07.907Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725644767.869
[taskcluster 2024-09-06T17:43:07.979Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725645787.924
[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557
[taskcluster 2024-09-06T18:00:08.022Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Vl2IihjYTp6GRHZ31eQ7qw/0/on/us-central1-b/6602075429580469503/until/1725646807.991
[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557
[taskcluster:error] Error uploading artifact: S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557
[taskcluster 2024-09-06T18:01:53.449Z] [mounts] Preserving cache: Moving "/task_172541804246164/cache0" to "caches/Pb5xwl8FSnumOKwVAsX1bA"
[taskcluster 2024-09-06T18:01:53.529Z] [mounts] Denying task_172541804246164 access to 'caches/Pb5xwl8FSnumOKwVAsX1bA'
[taskcluster 2024-09-06T18:01:53.625Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2024-12-03T00:21:16.511Z
[taskcluster:error] S3 returned status code 400 which could be an intermittent issue - see https://bugzilla.mozilla.org/show_bug.cgi?id=1394557

A couple of things are noteworthy here:

My stab in the dark guess here is that something is going wrong with parallel uploads - perhaps we're getting rate limited by GCS or something of the sort. I think we'll need to log or otherwise find the full response we're getting to make this tractable. @matt-boris - I found worker logs in GCP for this, but they don't appear to log the full response. I didn't see anything in Sentry (but I also may not have access to the right project). Do you know of another place we could possibly get this? I would also appreciate you taking a look over the logs I extract, which I can send privately.

bhearsum commented 1 month ago

@matt-boris and I were discussing this a bit. We don't yet have a handle on the root cause, but one thing we did notice is that the alignments files are totally uncompressed. Compressing them (and then decompressing in downstream tasks, of course) might be a way to sidestep this. @eu9ene - are there reasons we avoided compressing these files? Obviously we'd be starting off some time + cpu cycles for artifact size. Based on what they look like though, I suspect that even the worst zstd compression would make a significant difference here.

eu9ene commented 1 month ago

I removed temporary files in #841. There's only one output file now. Probably one of the reasons why it's not compressed is because the align.py script is being used as an intermediate step in generate-shortlist.sh. I restarted 3 failed tasks, let's see if it works now.

matt-boris commented 1 month ago

In the worker logs that @bhearsum provided me, I see an artifact upload attempt for the aln.fwd file, for example, that has content length 8663973832, which is a little over 8.5GB. This is even after generic worker gzips the artifact for upload. Is the uncompressed artifact expected to be even bigger than that?

eu9ene commented 1 month ago

Actually I see now that the main artifact is compressed: public/build/corpus.aln.zst. Everything else is in tmp directory which will be removed in future runs.

is a little over 8.5GB

It's hard to say how big the final artifact will be as it depend on the trained language and amount of used data. I think compressed it will be in line with our regular sizes for dataset related artifacts.

petemoore commented 1 month ago

Looking at worker logs I see:

start-worker[920]: 2024/09/06 18:01:12 PUT /firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=***********************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=294a8d92cf2be3a03ef7eb822c20a725ebad7f362768c8417874cb6f22810635&X-Amz-SignedHeaders=host&x-id=PutObject HTTP/1.1"
start-worker[920]: Host: storage.googleapis.com"
start-worker[920]: User-Agent: Go-http-client/1.1"
start-worker[920]: Content-Length: 8876827031"
start-worker[920]: Content-Encoding: gzip"
start-worker[920]: Content-Type: application/octet-stream"
start-worker[920]: Accept-Encoding: gzip"
start-worker[920]: #015"
start-worker[920]: 2024/09/06 18:01:35 14 put requests issued to https://storage.googleapis.com/firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln.rev?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=***************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=99d1a366d581000979d296a86aa9c91e92d108462eedbf196efac9542ab46a19&X-Amz-SignedHeaders=host&x-id=PutObject"
start-worker[920]: 2024/09/06 18:01:35 Response"
start-worker[920]: 2024/09/06 18:01:35 HTTP/2.0 400 Bad Request"
start-worker[920]: Content-Length: 220"
start-worker[920]: Content-Type: application/xml; charset=UTF-8"
start-worker[920]: Date: Fri, 06 Sep 2024 18:01:35 GMT"
start-worker[920]: Server: UploadServer"
start-worker[920]: Vary: Origin"
start-worker[920]: X-Guploader-Uploadid: AD-8ljvfChDPR86xb9C4DwXlYd7IfTPEq6wmBreOXpfo_pcztArNgLkdjcH9cVf5VAnBv57SrGI"
start-worker[920]: #015"
start-worker[920]: <?xml version='1.0' encoding='UTF-8'?><Error><Code>ExpiredToken</Code><Message>Invalid argument.</Message><Details>The provided token has expired. Request signature expired at: 2024-09-06T17:38:45+00:00</Details></Error>"
start-worker[920]: 2024/09/06 18:01:49 14 put requests issued to https://storage.googleapis.com/firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=********************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=294a8d92cf2be3a03ef7eb822c20a725ebad7f362768c8417874cb6f22810635&X-Amz-SignedHeaders=host&x-id=PutObject"
start-worker[920]: 2024/09/06 18:01:49 Response"
start-worker[920]: 2024/09/06 18:01:49 HTTP/2.0 400 Bad Request"
start-worker[920]: Content-Length: 220"
start-worker[920]: Content-Type: application/xml; charset=UTF-8"
start-worker[920]: Date: Fri, 06 Sep 2024 18:01:49 GMT"
start-worker[920]: Server: UploadServer"
start-worker[920]: Vary: Origin"
start-worker[920]: X-Guploader-Uploadid: AD-8ljsNrOQwDdv1ZZB6czGW5xU2op4C_rm5573nNuCTttJc_RmDbFiyyD6DGeU0wwNqflKQOw4xiYR77g"
start-worker[920]: #015"
start-worker[920]: <?xml version='1.0' encoding='UTF-8'?><Error><Code>ExpiredToken</Code><Message>Invalid argument.</Message><Details>The provided token has expired. Request signature expired at: 2024-09-06T17:38:45+00:00</Details></Error>"
systemd[1]: Starting GCE Workload Certificate refresh..."
gce_workload_cert_refresh[63126]: 2024/09/06 18:01:50: Error getting config status, workload certificates may not be configured: failed to GET \"instance/gce-workload-certificates/config-status\" from MDS with error: error connecting to metadata server, status code: 404"
gce_workload_cert_refresh[63126]: 2024/09/06 18:01:50: Done"
systemd[1]: gce-workload-cert-refresh.service: Deactivated successfully."
systemd[1]: Finished GCE Workload Certificate refresh."

The relevant part seems to be:

<?xml version='1.0' encoding='UTF-8'?><Error><Code>ExpiredToken</Code><Message>Invalid argument.</Message><Details>The provided token has expired. Request signature expired at: 2024-09-06T17:38:45+00:00</Details></Error>"
start-worker[920]: 2024/09/06 18:01:49 14 put requests issued to https://storage.googleapis.com/firefoxcitc-public-artifacts/Vl2IihjYTp6GRHZ31eQ7qw/0/public/build/tmp/aln?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=*************************************************%2F20240906%2Fus%2Fs3%2Faws4_request&X-Amz-Date=20240906T165335Z&X-Amz-Expires=2710&X-Amz-Signature=294a8d92cf2be3a03ef7eb822c20a725ebad7f362768c8417874cb6f22810635&X-Amz-SignedHeaders=host&x-id=PutObject"
start-worker[920]: 2024/09/06 18:01:49 Response"
start-worker[920]: 2024/09/06 18:01:49 HTTP/2.0 400 Bad Request"

It looks to me like the signed url we get is only valid for a certain amount of time, and we are running out of time when trying to upload the file.

However, I do also see systemd[1]: Starting GCE Workload Certificate refresh in the middle of it all, so I wonder if some background cert refresh process is interfering with the upload.

petemoore commented 1 month ago

It looks to me like the credentials should be valid for 45 minutes. That seems like it should be plenty, so I'm curious why the token should be expired.

bhearsum commented 1 month ago

Actually I see now that the main artifact is compressed: public/build/corpus.aln.zst. Everything else is in tmp directory which will be removed in future runs.

is a little over 8.5GB

It's hard to say how big the final artifact will be as it depend on the trained language and amount of used data. I think compressed it will be in line with our regular sizes for dataset related artifacts.

To be clear: we're talking about the artifacts coming out of the alignments tasks, which AFAICT are uncompressed. (I don't see any indications that the inputs to it are uncompressed.)

bhearsum commented 1 month ago

It looks to me like the credentials should be valid for 45 minutes. That seems like it should be plenty, so I'm curious why the token should be expired.

Possibly of note: the first upload I see happens at Sep 6 16:53:49. The first ExpiredToken failure I see is at Sep 6 17:55:21. Given the size of the data, and the fact that generic-worker is (presumably) gzipping it first, maybe it's not shocking if an individual upload exceeds 45min? If the token is generated before the compression happens it seems like a plausible cause of this?

bhearsum commented 1 month ago

In a meeting today @eu9ene said that he's made some changes to remove the problematic files from the uploaded artifacts, which may fix this specific issue. We should have an answer on that in the next day or two.

@petemoore or @matt-boris - regardless of that, is it worthwhile tracking this general issue and/or preemptively adding some more logging?

matt-boris commented 1 month ago

Submitted https://github.com/taskcluster/taskcluster/issues/7260 for tracking. But please keep us posted on whether the changes you guys have made help circumvent this issue.

eu9ene commented 1 month ago

My fix helped. All 3 tasks have completed successfully. Here's how the artifacts look now https://firefox-ci-tc.services.mozilla.com/tasks/e4CbMFI-SWGQ29z7zyFLiQ. @bhearsum We can close this issue if we're tracking the underlying Tasckcluster issue separately.

bhearsum commented 1 month ago

I'm glad things are unblocked now! https://github.com/taskcluster/taskcluster/issues/7260 should be enough for tracking the underlying issue.