nshepperd / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
Other
1.15k stars 444 forks source link

Finetuning on the Full Model - OOM 1558M #37

Open vince-lynch opened 4 years ago

vince-lynch commented 4 years ago

Hello,

I've been using the only-train-x-layers work around for over a month on the 774 model. (Even working on Colab K80) with 12gb Ram, and it's been great!

Problem is, I'm now looking into getting the 1558M to work, using a google GCE 30gb 8CPU, 1x Nvidia V100 GPU 16gb.

Reaching OOM @ 15gb.

Now, I'm looking for advice on this, having experienced the CTRL full model on 2xP100's and reaching the same OOM, also on the v100.

I haven't tried 2xV100's I will.

But has anyone else, for example you nshepperd got the full to finetune?

kinoc commented 4 years ago

+1 Similar, been running 774M with SGD on Titan RTX (24Gb) for a few weeks.

Able to run 1558M on the Titan. But when it comes to training everything looks promising, getting validation loss estimate, then OOM. @ 16+GB. (Got 24Gb). Trying to NVLink the Titan RTX and 2080 Ti together but slow progress on that front. Wondering if Unified Memory is worth looking at.

GNU-Linuxer commented 4 years ago

Hello,

I've been using the only-train-x-layers work around for over a month on the 774 model. (Even working on Colab K80) with 12gb Ram, and it's been great!

Problem is, I'm now looking into getting the 1558M to work, using a google GCE 30gb 8CPU, 1x Nvidia V100 GPU 16gb.

Reaching OOM @ 15gb.

Now, I'm looking for advice on this, having experienced the CTRL full model on 2xP100's and reaching the same OOM, also on the v100.

I haven't tried 2xV100's I will.

But has anyone else, for example you nshepperd got the full to finetune?

I also use GCE to train based on the OpenAi's 1558M model. For me, I am aware that GPU will not work, so I decide to use CPU. Apparently, I observed that the training process uses about 86GB of RAM (I use 64 CPU), and running the model costs 7 CPU and 25GB of RAM. Hope this helps! I'm using https://github.com/lakshayg/tensorflow-build tensorflow

Screen Shot 2019-11-10 at 10 56 03 AM

on a Ubuntu 18.04 VM on GCE (selecting the Intel Skylake GPU)

jkraybill commented 4 years ago

@GNU-Linuxer - are you using a n1-highcpu-96 instance? How long did fine-tuning take under CPU?

GNU-Linuxer commented 4 years ago

I’m not using that instance, since I only use 64 vCPUs (you would probably not need that much vCPU, about 45-50 vCPU should be sufficient, but you do need 88GB of RAM). It takes about 8 hours to train 1MB of source text before validation loss starting to increase.

GNU-Linuxer commented 4 years ago

@jkraybill @vince-lynch I've tried again using Google Cloud Engine VM to train on the 1558M model. I just used 14 vCPU and 85GB of RAM (since this configuration is somewhat most efficient), since the training process are unable to use a lot of vCPU.

jkraybill commented 4 years ago

@GNU-Linuxer thanks for the follow-up! I'm going to try the same setup.

jkraybill commented 4 years ago

Just a follow-up, I was able to successfully do full-depth training on the full-sized model, using Adam, on an Amazon r4.4xlarge EC2 instance (16 vCPU, 122GB RAM). CPU training is painfully slow, on my data set it was taking approx 30-40 seconds for a single step (batch size 1).

tomerwul commented 4 years ago

@jkraybill - what batch size did you use for training the full-sized model? did you use a GPU or CPU only?

Thanks!

jkraybill commented 4 years ago

@tomerwul per my last post I was using batch size of 1 so can't speak to when that tops out, on a r4.4xlarge instance which is CPU only. It's obviously a lot slower, but it does work and is nice to be able to do small tests on hardware that I can get in and out of for a few cents per hour.

kinoc commented 4 years ago

For anyone interested in this topic, it maybe worth looking at a gist I wrote for training with the IBM Large Model Support package. Fine-tune GPT-2 1558M on a Titan RTX using IBM Tensorflow Large Model Support v2

It takes a few minutes on startup while it rearranges the TF graph and inserts its swaps. The big issue is finding the right settings for the balance between GPU and CPU memory swaps. Some settings gave OOM but eventually you can probably find a setting that works by adjusting the swapout_threshold to get rid of OOM then adjust the rest for speed. I got my Titan to run with the default settings and a batch size of 1 at about 3.5 sec per update.