Use Electra for development instead of bert-128?

jowagner commented 3 years ago

Should development be carried out with bert-128 or electra (also with settings suitable for fast training)? Our bert-128 runs are for 48 hours (our cluster's time limit). Do we get good performance with electra when training for just 24 hours? How about 12 hours?

Simplest way to get checkpoints for 12 and 24 hours: Submit 2 jobs, one with a time limit of 24 hours and one with a time limit of 12 hours in the job file, and continue writing every checkpoint.

Ideally update code to write checkpoints only when close to deadline, i.e. when there is a risk that there is not enough time to complete the next checkpoint, based on measurements of step duration and checkpoint writing speed, and also writing a checkout every hour or so do be on the safe side.

How much time is spent on checkpoint writing? Does it slow down training considerably in the current setting?

Related: Training a final Electra model (issue #55)

jbrry commented 3 years ago

I am training an ELECTRA model for 500k steps. I decided to do this because it will give us the flexibility to compare to BERT at 500k steps, e.g. to get an idea of the upper bound.

This model will write checkpoints every 20k steps, so for the 12h/24h experiments, I will check the model directory and use the checkpoints which were written closest to those time intervals. In this way, we can do three experiments in one and should leave us with enough checkpoints so that we don't train for 12h but are forced to use a checkpoint only written at 8h. At the same time, writing every 20k steps should hopefully not slow down training too much but yes it would still be good to get an idea of how long this takes and the optimal checkpoint number for shorter jobs.

jowagner commented 3 years ago

Previously you said you only can keep the top k models for a fairly small k due to disk space limitations. Are you now keeping all checkpoints or do you login after 12h and 24h to copy the current best models within the respective time limit to a location where they will not be auto-deleted?

jowagner commented 3 years ago

Also, we need to think about what performance degradation compared with bert-128 we can accept and what trade-offs we are willing to make to be able to run more development experiments before the deadline for training the final model(s).

jbrry commented 3 years ago

For electra, we can specify to keep all checkpoints.

For this first experiment with electra, I will have a relatively high number of checkpoints, but that was so that we would have more checkpoints to choose from around the 12h/24h time limits. The reason I did not submit two jobs yesterday with sbatch time limits of 12h and 48h was that I did not want to use up 2 GPU nodes when one could be used for the BERT filtering experiment. But thinking about it again, a 12h run with frequent checkpoints would not interfere with the schedule too much and would give us a good idea of what can be expected in that timeframe.

Also, Electra's hyperparameter overrides are passed to run_pretraining.py through a JSON dictionary and my training script was written in bash. So I would have had to manually make changes to configure_pretraining.py to change things like checkpoint numbers. But since then, I've found about the jq tool which can be used for writing bash arguments to JSON: (https://linuxhint.com/bash_jq_command/) which should be useful for future experiments.

jowagner commented 3 years ago

Ok. I see the electa job is now nearly 21h in. If you are not keeping all checkpoints (not clear to me from above) can you login now and make a copy of the checkpoint closest to 12h and then again in 3 hours to get a 24h checkpoint please?

jbrry commented 3 years ago

Sorry, I should have added that I am keeping all 20k checkpoints for this run.

Looking in the <modeldir>, I have the ~12h checkpoint ready

May 24 15:59 model.ckpt-0.data-00000-of-00001 # first initialisation checkpoint
# then 12h48 mins later, I have checkpoint of step 100,000
May 25 04:48 model.ckpt-100000.data-00000-of-00001

I will log back in in a few hours to see what the closest checkpoint is to 24h.

jowagner commented 3 years ago

Can you start training dependency parsers with the 12h and 24h electra checkpoints today so that you don't run into a bottleneck when the bert-128 model using the same filter settings is ready?

I see the bert-128 jobs 12603[2-5], one of which ran over 12h, are replaced with new jobs 12610[0-3]. Does this mean the bert-128 model will only be ready on Thursday? Was there a major error in the jobs 12603[2-5]? One of the new jobs isn't running yet. If this is the bert-128 job that is needed for comparison with electra please cancel one if the other bert-128 jobs so that this job starts. You can re-submit the job afterwards.

If so and if bert-128 has to be used for development, we will not have enough time to finish the plan before the deadline:

27th 1pm: bert-128 models for the 4 filtering settings are ready
27th 3pm: median LAS over 5 parsers for the filtering setting used also with electra (I assume it takes 2h to train the parser and we can get 5 GPU nodes on Thursday. I can kill tri-training jobs if necessary. If your parser is a lot slower we could just switch to udpipe-future for our experiments and discuss the pros and cons of this choice in the paper)
27th 9pm: media LAS for the remaining 3 filtering settings
29th 9pm: bert-128 model without NCI
29th 11pm: median LAS
31st 11pm: bert-128 model with WordPiece vocabulary
1st of June, 1am: median LAS
deadline 1st of June EOB: not even half through the last experiment

My suggestion to fix this is to run the NCI and WordPiece experiments in parallel as a cross-product, i.e. add the following 3 runs:

without NCI
with WordPiece
with WordPiece and without NCI.

The run with NCI and SentencePiece will have been done as part of the filtering experiments. Then, take the best setting of these 4 for the vocab size experiment to be run in the last 50 hours before the deadline. With the results, it is also possible to simulate what we would have gotten if we had made decisions step by step, ignoring run 2 or 3 depending on the outcome of run 1.

Since we have only 4 Quadro RTX 6000 on the cluster but over 20 RTX 2080ti, can you also try electra with half the current batch size on a rtx2080ti, say for 32 hours? 32h would allow us to fit 3 experiments into the 4.5 days from Thursday night to the deadline.

jbrry commented 3 years ago

Can you start training dependency parsers with the 12h and 24h electra checkpoints today so that you don't run into a bottleneck when the bert-128 model using the same filter settings is ready?

Will do. Should be ready tonight.

I see the bert-128 jobs 12603[2-5], one of which ran over 12h, are replaced with new jobs 12610[0-3] ...

Yes what happened was the bert_config.json file had a vocabulary size of 30,100 for the last WordPiece experiment and I had to change it back to 30,101 which is what is created by wikibert-pipeline. I'm not sure what difference this makes, but I didn't want to rely on any results where the vocab size is off-by-1 so I had to re-run the BERT models, losing around 12h on some. Yes the 128 models for BERT will be ready Thursday afternoon. I cancelled OF-B to let OF-BCL run first as you've suggested and re-submitted OF-B.

Yes, that would roughly be the timeframe which would be insufficient.

My suggestion to fix this is to run the NCI and WordPiece experiments in parallel as a cross-product, i.e. add the following 3 runs:

Good suggestions; I will launch those 3 models when the nodes free up on Thursday or if the rtx2080ti experiment goes well.

can you also try electra with half the current batch size on a rtx2080ti [...]

Yes, I will try this.

jowagner commented 3 years ago

I see in scontrol show job 126027 that the electra job has no time limit other than the cluster's 48h limit. For the 4th bert run to start today, you will have to cancel the electra job after you got the ~~48h~~ 24h checkpoint. I see a checkpoint was written about 22:58 into the job a few minutes ago. The next checkpoint can be expected for 25:30 in about 80 minutes. I'd say take the 22:58 checkpoint (180k steps) and kill the job now. You don't want to risk that another user with higher priority (all but 2 users have higher priority than you at the moment, see sshare -a | sort -n -k 5) grabs the GPU before you.

Edit: wrong number

jbrry commented 3 years ago

Good idea, it would have been another 2 hours until the next checkpoint was written and it could have been used by someone else. The 4th (OF-B) setting is now running.

jowagner commented 3 years ago

The model checkpoints at 12 and 24 hours give inconsistent parser performance, see https://github.com/jbrry/Irish-BERT/blob/master/experiments/2021-05-26-gaelectra_short.md . Not suitable for making development decisions on other parameters.

Future work: Find out why.

jbrry / Irish-BERT

Use Electra for development instead of bert-128? #76