Open bhearsum opened 6 days ago
Alright, modulo the few missing groups noted above, I believe I have everything scraped and ready to upload. The upload is likely to take many days, so I'd appreciate a sanity check before doing so. The list of files that I will upload can be found on this spreadsheet. They will be uploaded in that exact directory structure into the root of this bucket (which is the same place I uploaded last time I scraped).
The total size of what I've scraped is 935G.
As with last time, I've scraped everything from tasks prefixed with one of the following strings: ("train-", "finetune-", "vocab", "export", "evaluate-", "quantize")
(as well as the training configs). This results in the following possible sub directories in individual experiment+group directories:
backward
evaluation
exported
quantized
student
student-finetuned
teacher0
teacher1
vocab
Within evaluation we have additional possible subdirectories:
backward
speed
student
student-finetuned
teacher-ensemble
teacher0
teacher1
@eu9ene - can you please sanity check the above before I start the upload? (I can also give you access to the instance the data is on if you'd like to poke around there instead.)
We need to scrape and upload the artifacts from the large scale training that we did in 2024 before they expire from Taskcluster. Over Matrix, @eu9ene told me that all task groups listed on the first two sheets of https://docs.google.com/spreadsheets/d/1EzcB-BSfC-_U_lg4NOfaTtBasOGJ9LxkURoqOcHajnM/edit?gid=305193406#gid=305193406 should be processed. That full list is attached: big-scrape-groups.txt.
Of the list above, the following task groups had no tasks that we scrape artifacts from (they were almost entirely dataset/clean/bicleaner/split/merge tasks; some had
train
tasks scheduled, but never ran):The following groups did not exist at all:
I'm still in the process of scraping and organizing these artifacts for upload; once I have all the files locally I'll dump the directory tree into a spreadsheet for review before I upload.