jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Computation budget plot #79

Open jowagner opened 3 years ago

jowagner commented 3 years ago

To create plots of LAS over computation budget following Dodge et al. (2019), we need 4-tuples

(task_ID, duration, LAS, dependencies)

where duration is in GPU seconds, a LAS of 0 is used for tasks that do not produce a LAS such as BERT pre-training and dependencies is a list of task_IDs that must be completed before the current task can be started. This should cover all our development experiments, including runs that did not result into a UD v2.8 LAS, so that the total computation budget reflects how much compute we used.

Parameter search dependencies should not be listed as dependencies. For the purpose of the budget plots, we pretend that we performed random sampling of hyper-parameter settings.

Example:

('filter1', 170450, 0.0, [])
('filter2', 172082, 0.0, [])
('filter3', 171423, 0.0, [])
('filter4', 170847, 0.0, [])
('filter1-LAS1', 7801, 77.4, ['filter1'])
('filter1-LAS2', 7655, 77.7, ['filter1'])
('filter1-LAS3', 7718, 77.2, ['filter1'])
('filter1-LAS4', 7837, 77.3, ['filter1'])
('filter1-LAS5', 7736, 77.4, ['filter1'])
('filter2-LAS1', 7725, 78.1, ['filter2'])
...

An alternative format that works for some types of dependencies is to provide a list of packages, where each package is a list or a nested list of (duration, LAS) pairs where each list says whether its elements must be run in order or can be shuffled.

If available, we should also record what type of GPU was used in each step. This could be useful to (a) correct for performance, e.g. report Quadro RTX 6000 GPU time and estimate RTX 6000 runtime for all steps that we run on other GPUs, and (b) create a more accurate plot over electricity costs or CO₂ footprint.

It should be noted that treating the set of hyper-parameter settings explored in hill-climbing as a set randomly sampled from all possible settings is likely to overestimate the performance for smaller compute budgets as the set is likely to contain more well-performing settings than to be expected in a random sample.