mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

scrape and upload 2024 training artifacts #948

Open bhearsum opened 6 days ago

bhearsum commented 6 days ago

We need to scrape and upload the artifacts from the large scale training that we did in 2024 before they expire from Taskcluster. Over Matrix, @eu9ene told me that all task groups listed on the first two sheets of https://docs.google.com/spreadsheets/d/1EzcB-BSfC-_U_lg4NOfaTtBasOGJ9LxkURoqOcHajnM/edit?gid=305193406#gid=305193406 should be processed. That full list is attached: big-scrape-groups.txt.

Of the list above, the following task groups had no tasks that we scrape artifacts from (they were almost entirely dataset/clean/bicleaner/split/merge tasks; some had train tasks scheduled, but never ran):

BxN54ej5Q8K4nBaBNdcZsQ
CJcaa0GLT9e2i5lNBwWazQ
D4JxfCWPTc22cU0du4rW_A
DbzZmDJZSLS9XNpmHX8OMA
E1yWYiDtRjeQs1khuAAZnw
FDIq7ZEmQB6BXWO2K9xrdw
FRubfBm4TLetQ4XlJIUISg
FZ-qmI7HSjyDeCOeJAWJBA
FnZjvwEvT9a0FTgJ_ll66Q
FxLCMgVyTjSuB839EZmTpA
GNZKnSbtQMiHczVfarqHwg
IwwoOph6RX-U1tAxA61l4Q
JQwJ5OITQmCNn0UAJ41T_w
KJ_cva4BSk623AV6wYIZlQ
KRXRJ_lTSWWrv0F20lORCQ
KhkyUfCIRD-ByQbxs310pA
KsSrCXPtRzCie4wkejInsA
Lhwmosd-R3aqMCt96ZugsQ
M8RvDoI7TnOuTk0kiFVIjg
O5nwoBdFSACkjNaOhhIwzw
OLD3_NcGRm-4RpQmXe0ngg
Ot4HVSSNSKqMVuVGthKsyg
Qg9PyeT9RRi_uv50g_f6sQ
QlPQlm85TAyEHL4qr_HXiA
RhuNiAW3SRqiBwoutaEGaQ
RkMIb_7XSEGHlNvNsdXmPQ
SsmDnqoBTyGStdOvvzK5Vg
T1RFo6nVQTy0iy1Bwdz7cQ
Tbkg0bxkROyaTd2C4tBpdg
Uv7EgA9SQdGT54nkWFwACQ
V-OmRM1yS_GwESTVDPl0VQ
V3x_-at2T5K2FU1ISqz1XA
VgR9RS46SIqkfem2mLGKcA
Vn1illbAREaz_zQy4VrbYw
WFoWgKmGRxa44hppCtvwmg
eZKkxqHISTCDwrsylAZZvA
ebpYrNxgQh-b5mRHbdi6bA

The following groups did not exist at all:

DXbS0zreSGSVYloAF8gwJg
EW7qV3U5SBSjegnTxGZHkw
K1iHndFUSxSEDRLg_H9l1A
Tkrf0fGBQEO6kH-gSKp5lg
aY25-4fXTcuJNuMcWXUYtQ
fYJkSp6IRYqnLvFOgwXPaA

I'm still in the process of scraping and organizing these artifacts for upload; once I have all the files locally I'll dump the directory tree into a spreadsheet for review before I upload.

bhearsum commented 5 days ago

Alright, modulo the few missing groups noted above, I believe I have everything scraped and ready to upload. The upload is likely to take many days, so I'd appreciate a sanity check before doing so. The list of files that I will upload can be found on this spreadsheet. They will be uploaded in that exact directory structure into the root of this bucket (which is the same place I uploaded last time I scraped).

The total size of what I've scraped is 935G.

As with last time, I've scraped everything from tasks prefixed with one of the following strings: ("train-", "finetune-", "vocab", "export", "evaluate-", "quantize") (as well as the training configs). This results in the following possible sub directories in individual experiment+group directories:

backward
evaluation
exported
quantized
student
student-finetuned
teacher0
teacher1
vocab

Within evaluation we have additional possible subdirectories:

backward
speed
student
student-finetuned
teacher-ensemble
teacher0
teacher1

@eu9ene - can you please sanity check the above before I start the upload? (I can also give you access to the instance the data is on if you'd like to poke around there instead.)