scrape and upload 2024 training artifacts

We need to scrape and upload the artifacts from the large scale training that we did in 2024 before they expire from Taskcluster. Over Matrix, @eu9ene told me that all task groups listed on the first two sheets of https://docs.google.com/spreadsheets/d/1EzcB-BSfC-_U_lg4NOfaTtBasOGJ9LxkURoqOcHajnM/edit?gid=305193406#gid=305193406 should be processed. That full list is attached: big-scrape-groups.txt.

Of the list above, the following task groups had no tasks that we scrape artifacts from (they were almost entirely dataset/clean/bicleaner/split/merge tasks; some had train tasks scheduled, but never ran):

BxN54ej5Q8K4nBaBNdcZsQ
CJcaa0GLT9e2i5lNBwWazQ
D4JxfCWPTc22cU0du4rW_A
DbzZmDJZSLS9XNpmHX8OMA
E1yWYiDtRjeQs1khuAAZnw
FDIq7ZEmQB6BXWO2K9xrdw
FRubfBm4TLetQ4XlJIUISg
FZ-qmI7HSjyDeCOeJAWJBA
FnZjvwEvT9a0FTgJ_ll66Q
FxLCMgVyTjSuB839EZmTpA
GNZKnSbtQMiHczVfarqHwg
IwwoOph6RX-U1tAxA61l4Q
JQwJ5OITQmCNn0UAJ41T_w
KJ_cva4BSk623AV6wYIZlQ
KRXRJ_lTSWWrv0F20lORCQ
KhkyUfCIRD-ByQbxs310pA
KsSrCXPtRzCie4wkejInsA
Lhwmosd-R3aqMCt96ZugsQ
M8RvDoI7TnOuTk0kiFVIjg
O5nwoBdFSACkjNaOhhIwzw
OLD3_NcGRm-4RpQmXe0ngg
Ot4HVSSNSKqMVuVGthKsyg
Qg9PyeT9RRi_uv50g_f6sQ
QlPQlm85TAyEHL4qr_HXiA
RhuNiAW3SRqiBwoutaEGaQ
RkMIb_7XSEGHlNvNsdXmPQ
SsmDnqoBTyGStdOvvzK5Vg
T1RFo6nVQTy0iy1Bwdz7cQ
Tbkg0bxkROyaTd2C4tBpdg
Uv7EgA9SQdGT54nkWFwACQ
V-OmRM1yS_GwESTVDPl0VQ
V3x_-at2T5K2FU1ISqz1XA
VgR9RS46SIqkfem2mLGKcA
Vn1illbAREaz_zQy4VrbYw
WFoWgKmGRxa44hppCtvwmg
eZKkxqHISTCDwrsylAZZvA
ebpYrNxgQh-b5mRHbdi6bA

The following groups did not exist at all:

DXbS0zreSGSVYloAF8gwJg
EW7qV3U5SBSjegnTxGZHkw
K1iHndFUSxSEDRLg_H9l1A
Tkrf0fGBQEO6kH-gSKp5lg
aY25-4fXTcuJNuMcWXUYtQ
fYJkSp6IRYqnLvFOgwXPaA

I'm still in the process of scraping and organizing these artifacts for upload; once I have all the files locally I'll dump the directory tree into a spreadsheet for review before I upload.

Alright, modulo the few missing groups noted above, I believe I have everything scraped and ready to upload. The upload is likely to take many days, so I'd appreciate a sanity check before doing so. The list of files that I will upload can be found on this spreadsheet. They will be uploaded in that exact directory structure into the root of this bucket (which is the same place I uploaded last time I scraped).

The total size of what I've scraped is 935G.

As with last time, I've scraped everything from tasks prefixed with one of the following strings: ("train-", "finetune-", "vocab", "export", "evaluate-", "quantize") (as well as the training configs). This results in the following possible sub directories in individual experiment+group directories:

backward
evaluation
exported
quantized
student
student-finetuned
teacher0
teacher1
vocab

Within evaluation we have additional possible subdirectories:

backward
speed
student
student-finetuned
teacher-ensemble
teacher0
teacher1

@eu9ene - can you please sanity check the above before I start the upload? (I can also give you access to the instance the data is on if you'd like to poke around there instead.)

mozilla / translations

scrape and upload 2024 training artifacts #948