mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
145 stars 31 forks source link

We should audit our storage used in a full pipeline run #857

Closed gregtatum closed 4 days ago

gregtatum commented 1 week ago

In one of our big spring-2024 runs that went end-to-end we should write a script to get all the tasks from the Taskcluster API. Then iterate over them all, fetch the artifacts for each task, then compute the total size of them to see what's going on.

gregtatum commented 4 days ago

Medium resource languages use 300-500 GB from what I've looked at. Using the public pricing at https://cloud.google.com/storage/pricing#north-america

$0.023 GB/month for 12 months is: $82-$138

gregtatum commented 4 days ago

Most of the size are the copies of the dataset pipeline.

en-cs: DtSyAeaVRoGNZDnUKscGWw

pie chart of en-cs costs

en-fi - bNBrAkLqQpCpuxfMe3I-mw

pie chart of en-fi costs