OOM error with new 774M model when running in Colab

ghost commented 5 years ago

When running sess command, getting OOM issue. Not sure if new large model is too large for Colab?

WARNING: Logging before flag parsing goes to stderr. W0820 16:58:18.137592 140704259733376 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/sample.py:17: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where

ResourceExhaustedError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, args) 1355 try: -> 1356 return fn(args) 1357 except errors.OpError as e:

7 frames ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node model/wte/Initializer/random_normal/RandomStandardNormal}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

ResourceExhaustedError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1368 pass 1369 message = error_interpolation.interpolate(message, self._graph) -> 1370 raise type(e)(node_def, op, message) 1371 1372 def _extend_graph(self):

ResourceExhaustedError: OOM when allocating tensor with shape[50257,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model/wte/Initializer/random_normal/RandomStandardNormal (defined at /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py:185) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Original stack trace for 'model/wte/Initializer/random_normal/RandomStandardNormal': File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start ioloop.IOLoop.instance().start() File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start handler_func(fd_obj, events) File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events self._handle_recv() File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv self._run_callback(callback, msg) File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback callback(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell handler(stream, idents, msg) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin) File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, kwargs) File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes if self.run_code(code, result): File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 12, in save_every=500 File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py", line 170, in finetune output = model.model(hparams=hparams, X=context) File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py", line 185, in model initializer=tf.compat.v1.random_normal_initializer(stddev=0.02)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1496, in get_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1239, in get_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 562, in get_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 514, in _true_getter aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 929, in _get_single_variable aggregation=aggregation) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 259, in call return cls._variable_v1_call(*args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 220, in _variable_v1_call shape=shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 198, in previous_getter = lambda kwargs: default_variable_creator(None, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 2511, in default_variable_creator shape=shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 263, in call return super(VariableMetaclass, cls).call(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1568, in init shape=shape) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py", line 1698, in _init_from_args initial_value(), name="initial_value", dtype=dtype) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 901, in partition_info=partition_info) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py", line 323, in call shape, self.mean, self.stddev, dtype, seed=self.seed) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/random_ops.py", line 79, in random_normal shape_tensor, dtype, seed=seed1, seed2=seed2) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 728, in random_standard_normal seed2=seed2, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

minimaxir commented 5 years ago

It is likely not possible to finetune 774M. Discussion here: https://news.ycombinator.com/item?id=20749037

I need to run tests to determine how well it works; if it's not possible, I'll add a bespoke assert to prevent finetuning on it.

sdan commented 5 years ago

Would using fp16 help?

saippuakauppias commented 5 years ago

Maybe it can be run at: https://cloud.google.com/compute/all-pricing#gpus ? What should be the minimum configuration? Or are there any places where it will cost less? Is it possible to use repo: https://github.com/minimaxir/gpt-2-cloud-run ?

minimaxir commented 5 years ago

There is no magic switch for FP16 in TensorFlow [yet], and the 16 GB VRAM offered by cloud GPUs still isn't enough.

If there are any workarounds, I would be interested in them.

saippuakauppias commented 5 years ago

Need to recompile tensorflow to use fp16? I have some experience in this, I can tell you how to do it without any special difficulties.

woctezuma commented 5 years ago

For reference, this was @AdamDanielKing's answer on HackerNews:

TalkToTransformer.com uses preemptible P4 GPUs on Google Kubernetes Engine. Changing the number of workers and automatically restarting them when they're preempted is easy with Kubernetes.

To provide outputs incrementally rather than waiting for the entire sequence to be generated, I open a websocket to a a worker and have it do a few tokens at a time, sending the output back as it goes. GPT-2 tokens can end partway through a multi-byte character, so to make this work you need to send the raw UTF-8 bytes to the browser and then have it concatenate them before decoding the string.

While my workers can batch requests from multiple users, the modest increase in performance is probably not worth the complexity in most cases.

I won't say that I understand everything though.

AdamDanielKing commented 5 years ago

@woctezuma That comment only explains how to deploy a trained model, which requires much less GPU memory than training because the gradients aren't stored. @minimaxir is probably right that for training you won't fit a full batch of 774M training samples in the K80 GPU that Colab gives you.

@minimaxir You can work around this by training with a smaller batch size but accumulating gradients over several iterations before applying an update to the weights. That achieves a larger effective batch size than can fit in the GPU. This page might be helpful.

minimaxir commented 5 years ago

The workflow for 345M finetuning uses a batch size of 1 w/ accumulated gradients. That is the workflow the 774M should be using now, with apparently no success.

AdamDanielKing commented 5 years ago

Ah, I see. That's surprising.

I know OpenAI uses gradient checkpointing for some of their other work, so in that case I'd bet they use it in their training code for GPT-2 as well. See https://github.com/cybertronai/gradient-checkpointing. Instead of storing all the layer activations at once, this stores a subset of them and then recomputes them during the backward pass to significantly reduce memory usage. In my experience it's pretty easy to get that library working, and if you do then it should be effective.

Another workaround is to only train with sequences significantly shorter than the maximum of 1024 tokens.

saippuakauppias commented 5 years ago

Maybe should turn back and try to implement using TPU or Multiple GPU?

In your opinion, which option would be preferable to not return to this issue in the future (when 1558M parameter model will be released)? (I assume that Colab may not be enough for this, of course)

sdan commented 5 years ago

@AdamDanielKing, This repo took a good chunk of nshepperd's codebase as @minimaxir has said in the past. This means this repo automatically does gradient checkpointing for anything that is not 117 (see gpt_2.py).

@saippuakauppias I already tried it on all GPU configurations/RAM/CPU configurations on GCP. After 10-15 failed attempts did I realize it was an issue and was prompted to HN to see @minimaxir and @AdamDanielKing's discussion.

And @saippuakauppias someone has already tried TPU: [https://colab.research.google.com/github/shawwn/gpt-2/blob/tpu/Training_GPT_2_Using_TPUs.ipynb](Colab using TPU). At this moment I didn't get the best results, although I have to do some data preprocessing to see what the exact issue it. I'm also getting a pretty high loss on it.

@saippuakauppias can you help me recompile TF to only use FP16?

AdamDanielKing commented 5 years ago

@dantuluri Thanks for pointing this out. It looks like the code only uses 1 gradient checkpoint at layer 10:

https://github.com/minimaxir/gpt-2-simple/blob/4c36ea73164cdf0f15b39f02dbefa8eef96f671e/gpt_2_simple/src/model.py#L195-L196

The code is using memory_saving_gradients in 'collection' mode, so it doesn't automatically add any other checkpoints. 774M has 36 layers, so this means the activations of at least 26 layers will be in memory at the same time. I'd suggest adding many more checkpoints or trying the other modes.

saippuakauppias commented 5 years ago

@dantuluri, I misunderstood the discussion on "Hacker News" (recompilation is not needed). FP16 is already available in tensorflow version 1.14:

Can anyone check if this helps for a Colab or for Cloud Run?

PS: if you suddenly need to recompile TF, then here is the easiest way: https://github.com/yaroslavvb/tensorflow-community-wheels/pull/121/files

minimaxir commented 5 years ago

If FP16 is indeed in TensorFlow 1.14 via pip, I'll give it a test.

sdan commented 5 years ago

Looking into the code it seems @minimaxir used https://github.com/cybertronai/gradient-checkpointing for gradient checkpointing.

I used the variations:

collection (which appears to be default)
speed (ran into the same OOM problems)
memory (ran into: 'unable to find bottleneck tensors! please provide checkpoint nodes manually, or use checkpoints="speed"')

Here are the definitions of each variation just for reference:

'collection' (default): This checkpoints all tensors returned by tf.get_collection('checkpoints'). You then need to make sure you add tensors to this collection using tf.add_to_collection('checkpoints', tensor) when you define your model.
'memory' : This uses a heuristic to automatically select a set of nodes to checkpoint which achieves our desired O(sqrt(n)) memory usage. The heuristic works by automatically identifying articulation points in the graph, i.e. tensors which split the graph into two disconnected parts when removed, and then checkpointing a suitable number of these tensors. This currently works well for many, but not all, models.
'speed' : This option tries to maximize running speed by checkpointing the outputs of all ops that are typically expensive to compute, namely convolutions and matrix multiplies.

I think FP16 is probably the way to go if it works as @minimaxir said.

sdan commented 5 years ago

@saippuakauppias do you know how to use FP16? Not too familiar on how to start using it.

saippuakauppias commented 5 years ago

@dantuluri No, but now I'm trying to figure out how to use it.

An example of how to enable FP16: https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/docs/amp/notebook_v1.14/auto_mixed_precision_demo_cifar10.ipynb

You just need to wrap the optimizer in tensorflow.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite and that's it!

Documentation: https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite https://gist.github.com/tlkh/fa20c5bf3c8b48def4501cccff8b3559

AdamDanielKing commented 5 years ago

There should definitely still be more gradient checkpoints. I just ran some tests on a K80 setting accumulate_gradients to 1 and seeing how many samples can fit in a batch without running out of memory.

Model	Checkpointing just layer 10	Checkpointing all layer outputs
345M	1 sample fits	8 samples fit

The only code change is removing the if layer == 10: line. This makes the the large internal activations of each layer (attention layer, MLP layer) be recomputed, with only the skip connections between each layer being stored. Still, the optimal strategy is likely to be a bit different from this.

Unfortunately I'm still struggling to fit 1 sample of 774M into memory mainly because the attn function inside each layer requires a lot of memory.

Edit: By the way, adding more checkpoints doesn't have a performance hit because its only effect is to not deallocate and recompute the checkpointed layer. So you just want to choose the checkpoints in a way that minimizes the peak memory usage.

sdan commented 5 years ago

How does nshepperd's fork deal with this? It seems like he puts gradient checkpointing at all layers: if args.accumulate_gradients > 1: not sure though.

AdamDanielKing commented 5 years ago

@dantuluri He also has the if layer == 10: line, so only one checkpoint. When memory_saving_gradients is in 'collection' mode (the default) it only uses checkpoints that you explicitly add.

sdan commented 5 years ago

Interesting. Going through speed, memory, and collection modes removing if layer == 10. Will update on once done (running on 16 vram V100)

sdan commented 5 years ago

Update: speed and memory options don't work. Just collection works (by default). All you need to do is delete the if layer == 10 line` (I've tried if layer == 5 and 2 and still didn't work) as @AdamDanielKing said.

Currently running on V100. Will try on lower VRAM GPUs.

Edit: Can't vouch for quality of training. Just saw it training and thought it works. Edit: Running on P100 is works fine.

AdamDanielKing commented 5 years ago

@dantuluri Perfect! This is with 774M, right? How many samples fit if you set accumulate_gradients to 1 and vary batch_size? Can you get more than one?

I think the boundary between not fitting a sample and fitting one is between 12 GB and 16 GB. I wasn't able to get one of the K80s that Google offers (12 GB) to work. So it seems we still can't train for free on Colab.

Edit: One place in particular that seemed helpful to add a checkpoint was at the model's output:

https://github.com/minimaxir/gpt-2-simple/blob/4c36ea73164cdf0f15b39f02dbefa8eef96f671e/gpt_2_simple/gpt_2.py#L170

I suggest experimenting with adding it

    output = model.model(hparams=hparams, X=context)
    tf.compat.v1.add_to_collection('checkpoints', output['logits'])

and seeing if that increases the number of samples you can fit on the GPU. Edit Sept 21: The line above had a bug but should work now. Still not certain that it lowers the overall memory usage but it's worth trying.

Memory peaks around there, and checkpointing seemed to bring the peak usage down while I was playing with the K80.

AdamDanielKing commented 5 years ago

Updated my code suggestion one last time. ^

sdan commented 5 years ago

At this moment it's training, accumulate_gradients = 1 and batch_size = 1 as default.

I'm think my input may be wrong because it's structured like this:

<|startoftext|>
hello world
more text more text more text
more text more text more text
<|endoftext|>
<|startoftext|>
more text more text more text
more text more text more text
<|endoftext|>

and so on. But I'm getting the start and end tags in my results... like this:

something
something
something
<|endoftext|>
<|startoftext|>
something
something
sometimes weird characters

Because I'm a bit more familiar with nshepperd's code, do you know where he did the checkpointing (layer == 10)? I got better results training 335 using his code.

Otherwise, any help on getting this code to work with my data would be much appreciated.

In regards to GPU memory usage: I'm using a P100 which has 16GB VRAM. When training, regardless of model (335 or 774) it always maxes out, to around 99% all the time (except when generating samples, when it goes to around 50%).

In regards to loss The loss is really low compared to 335. I'm getting around 1.5 out of the gate, opposed to around high 2's with 335.

Quality of results For the short time I've been training it, it's not a whole lot greater than 335. This will hopefully change. And as said before, the <|endoftext|> <|startoftext|> tags are somewhat annoying when in the middle of the results... not to mention... where can I delete ======== SAMPLE 1 ========? It's always showing up in all my samples. And when the program saves these samples, it doesn't save them in .txt files, just samples-100 with no extension. With nshepperd's code I could easily make these adjustments. I'm not sure where I can make them in this code.

In regards to your suggestion Haven't tried output = model.model(hparams=hparams, X=context) tf.compat.v1.add_to_collection('checkpoints', output)will update when I get I get the data issues out of the way.

saippuakauppias commented 5 years ago

Maybe @nshepperd already tried to solve this problem too?

rfriel commented 5 years ago

I've been trying to finetune 774M using nshepperd's fork on a ~~p3.2xlarge~~ EDIT: p3.xlarge (12GB), and after trying the checkpointing suggestions here I was still running out of memory. But I got it to run (at least on batches of size 1, haven't tried larger) by changing the optimizer from Adam to vanilla SGD, which I assume has a smaller memory footprint because it lacks Adam's moving averages.

It hasn't run long enough for me to really assess whether vanilla SGD is good enough for the finetuning I want, though. I'll also probably have to play with the learning rate a bit relative to what I used with Adam.

AdamDanielKing commented 5 years ago

@rfriel You mention p3.2xlarge having a 12 GB GPU–is this a typo? It should have a 16 GB V100. I wasn't able to fit 774M into only 12 GB (a K80).

rfriel commented 5 years ago

@rfriel You mention p3.2xlarge having a 12 GB GPU–is this a typo? It should have a 16 GB V100. I wasn't able to fit 774M into only 12 GB (a K80).

Whoops! My typo was in the instance name -- I am using a p2.xlarge, which has a T80. It does fit when I use vanilla SGD (tf.train.GradientDescentOptimizer), and in fact I can fit a batch_size of 2 (haven't tried higher). I'm also using the checkpointing recommendations from this thread.

Haven't been able to get an adaptive optimizer like Adam to fit, even with a batch size of 1. I tried Adadelta too. EDIT: MomentumOptimizer fits (with use_nesterov=True although I imagine it works either way).

minimaxir commented 5 years ago

FYI, all work handling the 774M model is currently being worked on the 0.6 branch.

nshepperd's fork does implement SGD as an option: I'm open to porting that into package if it does indeed help solve this problem.

sdan commented 5 years ago

As said in #111

It's a gpt2 version of the "grover mega" model. The original is not currently available for public download. Parameter file here works with model and the grover repo:

https://github.com/rowanz/grover/tree/master/lm/configs - see mega.json

Can load and generate sample articles using only an 8 gig 1070. The .jsonl format used by grover for input and output is, well, "ugly" to say the least (see repo example.) If someone can modify to fine-tune on standard text files and output same as per other repos then it might prove useful.

This custom 1.5B model bodes well (for generation at least) if OpenAi releases their 1.5B. Fine-tuning with adam will be another issue as it currently is with 774M.

On a side note: Have sgd/ all layers/ memory saving gradients=True working with nshepperd repo and 774M. Results ok but not as good as fine-tuning 345M adam. Have not had similar success with gpt2simple.

Looking forward to future code upgrades and solutions.

The SGD solution may be ideal, but having all layers as @AdamDanielKing may not be ideal.

I have tried the latest update on the 0.6 branch with SGD and got an OOM error on a 16GB VRAM card.
Results from all layers (no layers = 10) with 774 don't look too much better than 335 with adam, but at the same time I'm not an english teacher to determine how much greater/worse this 774 training variant is.

I will try doing less layers now.

minimaxir commented 5 years ago

After testing, yes as noted above SGD is not sufficient. Mixed precision has a few preconditions that make it not "simple" to run on the user end even if it gets working.

I've added the assert now for 0.6. If there is new info I'm welcome with adding a fix.

minimaxir commented 5 years ago

0.6 is now merged with master and released on PyPI. I'll keep this open in case anyone has a better solution.

sdan commented 5 years ago

Spent some time with it. There is 1-2 issues, but the biggest one is stopping anyone from training 774. I understand why this makes sense, but removing layers == 10 in src/model.py is probably the best option for now.

I've tried @AdamDanielKing's methods of adding a checkpoint after the output but I got some dict error. I've also tried checkpointing at layers == 2 and 5 and they don't work, so it seems deleting the line entirely is the only way to go with 774.

Will update after training.

Edit Just saw: add assert to prevent finetuning 774M for now #108. At the moment, I that makes sense.

minimaxir commented 5 years ago

Also, to be perfectly honest, I'm not sure if typical users have enough training data to get better results on finetuning 774M vs. 345M (my guess is you'd need a few GB of text).

AdamDanielKing commented 5 years ago

@minimaxir That's a good point. One could get worse results due to the great potential for overfitting. That might be partly averted by adding dropout layers to the network. I think OpenAI did train GPT-2 with dropout even though it's not in the public code. It's used in all similar transformers, including GPT-1.

jesus255221 commented 5 years ago

Hey guys. Is there any update on this issue? I use the V100 with 16 GB memory to train the 774M model and deleting the if layer == 10: in src/model.py in the fork version of nshepperd's gpt-2 but I still encounter the OOM problem.

sdan commented 5 years ago

Currently working at: https://github.com/dantuluri/gpt-2.git. Some Google AI engineers made an optimizer which was publicized a week ago: https://arxiv.org/pdf/1901.11150.pdf and am working on implementing it. At the moment I'm trying out stuff while keeping in touch with some of the authors of this paper to figure out how to reduce the OOM issues. If you do have any PRs go ahead and submit them to my repo. Once I resolve the OOM issues I'll PR into this or nshepperd's repo.

What Max has done for now is depreciate the use of 774M for now, which makes sense for the general public.

Edit: Just to clarify, most of the work on implementing the optimizer is done. I just need to tweak some stuff here (at least I hope that's all I need to do :) )

saippuakauppias commented 5 years ago

SM3 optimizer for training extremely large models: https://github.com/google-research/google-research/blob/master/sm3/sm3.py

@dantuluri its works? Or still need more than 16GB of video memory?

PS: nshepperd's repo is outdated, this repo is better for PR

jhave commented 5 years ago

@dantuluri thanks for code. i tried it. unfortunately even with small dataset it's still resulting in ResourceExhaustedError. TitanX 12G of GPU ram is not enough. Sigh. --- any updates? or perhaps I'm doing something wrong....

erickrf commented 5 years ago

@jhave the size of the dataset doesn't really matter, since it only goes to the GPU one batch at a time. The model size itself is the problem.

sdan commented 5 years ago

At the moment I am running into OOM issues using SM3 (by Google). After talking to them and going through the paper once more gave me some ideas on what the hyperparams should be (such as momentum being 0 to improve memory).

Here's my plan: I have a lot more time in the next couple of weeks so I'll be talking to them and messing around with the optimizer/memory saving gradients to find a way to run it on a V100 at least.

As for @jhave, I'm not surprised for the issues. I need to clean up some code and yes, it at the moment does not work, but as I said, hopefully I'll get it to soon.

Meanwhile, I heard using TPUs is a great option. I know Shawn did this: https://colab.research.google.com/github/shawwn/gpt-2/blob/tpu/Training_GPT_2_Using_TPUs.ipynb, but I haven't looked too much into it yet, although it seems promising for the vast amount of users.

I will keep everyone updated.

saippuakauppias commented 5 years ago

Maybe someone tried the implementation on PyTorch (from huggingface)?

Does it contain an equivalent model or did they train their own? (It seems to be just converted: https://huggingface.co/pytorch-transformers/converting_tensorflow_models.html#openai-gpt-2 )

I tried to train on the 774M model, but also die on the out of memory. Here is my code for the colab: https://colab.research.google.com/drive/1_fRXWrtV14HgNXxHnzVWIf1gAH2fE8n9 With 355M model on colab everything is fine. But I can’t give an accurate assessment of the quality after generation. But it seems to me that it is worse than finetuning in this repo based on tensorflow (most likely I pass some incorrect parameters to the finetuning launch?).

I tested this implementation with an 355M model on 3 x 1080Ti GPU and it uses every GPU memory, but failed die on out of memory (retest this tomorrow with gradient_accumulation_steps>5). And maybe tomorrow I will check a 774M model at 4 x 1080Ti (or 6x/8x) GPU, but it seems to me that she still does not have enough memory (for 1 batch in 1 GPU).

PyTorch has distributed training: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255 (training on several machines with several GPUs in each). It looks simple, but I don’t know where it can be tested. And there is no certainty that this will help either.

PS: May be interesting https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9

likeoff commented 5 years ago

So, after all said above, what is solution for run retrain on multi-gpu environment?

saippuakauppias commented 5 years ago

I think there is no solution at the moment.

I tried to start fine-tuning the model 355M on pytorch (gpt-2 pytorch-transformers) in 10 x 1080Ti, but it still rests on the memory limit.

saippuakauppias commented 5 years ago

This is strange, but without if layer == 10: on SGD optimizer - finetuning eat only 8.5GB video memory (in 125 step now)...

UPD: Adam optimizer use 16.7GB video memory

UPD:

Adding

tf.compat.v1.add_to_collection('checkpoints', output)

suggested by @AdamDanielKing gives me exception:

  File "/usr/local/lib/python3.5/dist-packages/gpt_2_simple/gpt_2.py", line 217, in finetune
    opt_grads = memory_saving_gradients.gradients(loss, train_vars)
   File "/usr/local/lib/python3.5/dist-packages/gpt_2_simple/src/memory_saving_gradients.py", line 160, in gradients
    checkpoints = list(set(checkpoints).intersection(ts_all))
TypeError: unhashable type: 'dict'

UPD: with SGD I get an error on sampling (unicode error)

UPD: an increase of batch_size with SGD does not change the amount of memory, but the finetuning time increases (with Adam - memory increases)

sdan commented 5 years ago

Got that same error too. Didn't dive too deep into it but tried other things. Did you check out OpenAI's latest post on how to finetune gpt2? I'm going to delve into that and see what I can find in an upcoming blog post of mine.

On Sat, Sep 21, 2019, at 2:15 PM, Denis Veselov wrote:

Adding

tf.compat.v1.add_to_collection('checkpoints', output) suggested by @AdamDanielKing https://github.com/AdamDanielKing gives me exception:

File "/usr/local/lib/python3.5/dist-packages/gpt_2_simple/gpt_2.py", line 217, in finetune opt_grads = memory_saving_gradients.gradients(loss, train_vars) File "/usr/local/lib/python3.5/dist-packages/gpt_2_simple/src/memory_saving_gradients.py", line 160, in gradients checkpoints = list(set(checkpoints).intersection(ts_all)) TypeError: unhashable type: 'dict' — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/minimaxir/gpt-2-simple/issues/108?email_source=notifications&email_token=AFOWOC6D6XWPHV2OZPJE5XLQK2FG3A5CNFSM4IN2DWR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IZYUQ#issuecomment-533830738, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOWOCZN63G4AAZ5XYGUV43QK2FG3ANCNFSM4IN2DWRQ.

saippuakauppias commented 5 years ago

Did you check out OpenAI's latest post on how to finetune gpt2?

This post https://openai.com/blog/fine-tuning-gpt-2/ ? I did not find anything interesting in him that could help us. I may have read poorly.

UPD: FP16 need NVIDIA Compute Capability >= 7.0 Only RTX 6000 (24 GB), RTX 8000 (48 GB), Quadro GV100 (32 GB), TITAN RTX (24 GB) can use it for this task. Others do not have enough memory at the moment, but maybe with SGD and removing if layer == 10: on video cards with 16GB of memory everything will be fine - need to check.

AdamDanielKing commented 5 years ago

@saippuakauppias That line I gave had a bug. What I meant was to try adding output['logits'].

saippuakauppias commented 5 years ago

@AdamDanielKing thanks for quick reply!

Unfortunately, memory consumption has not decreased: Adam = 16.7GB, SGD = 8.5.GB. Should there be a decrease of loss or step time?

minimaxir / gpt-2-simple

OOM error with new 774M model when running in Colab #108