salesforce / ctrl

Conditional Transformer Language Model for Controllable Generation
https://arxiv.org/abs/1909.05858
BSD 3-Clause "New" or "Revised" License
1.87k stars 208 forks source link

Fine-tuning on Colab #63

Open manueltonneau opened 4 years ago

manueltonneau commented 4 years ago

I am trying to fine-tune the model in Google Colab in a [Python 3 / GPU] runtime type. After launching the training, it suddenly stops, indicating "^C" though I haven't pressed Ctrl-C. The last messages before it stops are:

2019-11-28 09:49:59.717067: W tensorflow/core/framework/allocator.cc:107] Allocation of 41943040 exceeds 10% of system memory. tcmalloc: large alloc 1262256128 bytes == 0x11ff50000 @ 0x7f3452087b6b 0x7f34520a7379 0x7f340d73b754 0x7f340d6f6c8a 0x7f340d433f11 0x7f340d4415b2 0x7f340d449dda 0x7f3416665097 0x7f3416666bee 0x7f3416666dcd 0x7f3416660a3b 0x7f341660d781 0x7f341660e164 0x7f341651d3b9 0x7f341651e72a 0x7f3416520187 0x7f3416522122 0x7f34165156d1 0x7f341651713c 0x7f34134ad211 0x7f34134af0a6 0x7f34134b0f26 0x7f34134b1654 0x7f3410dd3755 0x7f34134ee7cd 0x7f34134ef505 0x7f3410dd0b58 0x7f3410dd0c9a 0x7f3410d87f8e 0x50a84f 0x50c549 tcmalloc: large alloc 1262256128 bytes == 0x11ff50000 @ 0x7f3452087b6b 0x7f34520a7379 0x7f340d73b754 0x7f340d6f6c8a 0x7f340d433f11 0x7f340d4415b2 0x7f340d449dda 0x7f3416665097 0x7f3416666bee 0x7f3416666dcd 0x7f3416660a3b 0x7f341660d781 0x7f341660e164 0x7f341651d3b9 0x7f341651e72a 0x7f3416520187 0x7f3416522122 0x7f34165156d1 0x7f341651713c 0x7f34134ad211 0x7f34134af0a6 0x7f34134b0f26 0x7f34134b1654 0x7f3410dd3755 0x7f34134ee7cd 0x7f34134ef505 0x7f3410dd0b58 0x7f3410dd0c9a 0x7f3410d87f8e 0x50a84f 0x50c549 2019-11-28 09:50:18.233025: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

Here is the link to my colab notebook for more info: https://colab.research.google.com/drive/1HZlVxvrH1JbcLKa-Z437MjsmYK9NcLwh

Let me know if you know where that comes from. Thanks a lot in advance :)

liya-gafurova commented 4 years ago

Also have a problem during fine-tuning in Colab. runtime Python 2 / GPU (also with Python 3 / GPU). After launching the training, it suddenly stops, indicating "^C" though I haven't pressed Ctrl-C. The last messages before it stops are:

2019-11-28 08:04:36.283362: I tensorflow/core/common_runtime/placer.cc:54] save/RestoreV2/tensor_names: (Const)/job:localhost/replica:0/task:0/device:CPU:0 save/RestoreV2/shape_and_slices: (Const): /job:localhost/replica:0/task:0/device:CPU:0 2019-11-28 08:04:36.283384: I tensorflow/core/common_runtime/placer.cc:54] save/RestoreV2/shape_and_slices: (Const)/job:localhost/replica:0/task:0/device:CPU:0 2019-11-28 08:04:38.175585: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. 2019-11-28 08:04:38.441567: W tensorflow/core/framework/allocator.cc:107] Allocation of 1262254080 exceeds 10% of system memory. 2019-11-28 08:04:38.441703: W tensorflow/core/framework/allocator.cc:107] Allocation of 1262254080 exceeds 10% of system memory. tcmalloc: large alloc 1262256128 bytes == 0x55ee80adc000 @ 0x7f5e1144eb6b 0x7f5e1146e379 0x7f5dce342754 0x7f5dce2fdc8a 0x7f5dce03b76d 0x7f5dce04d7ab 0x7f5dce01ea96 0x7f5dce01ef69 0x7f5dce01f0c5 0x7f5dd2c88ee8 0x7f5dd2c8a8b2 0x7f5dce317cd4 0x7f5dce316b44 0x7f5e0ff809e0 0x7f5e10e306db 0x7f5e1116988f tcmalloc: large alloc 1262256128 bytes == 0x55ef4592a000 @ 0x7f5e1144eb6b 0x7f5e1146e379 0x7f5dce342754 0x7f5dce2fdc8a 0x7f5dce03b76d 0x7f5dce04d7ab 0x7f5dce01ea96 0x7f5dce01ef69 0x7f5dce01f0c5 0x7f5dd2c88ee8 0x7f5dd2c8a8b2 0x7f5dce317cd4 0x7f5dce316b44 0x7f5e0ff809e0 0x7f5e10e306db 0x7f5e1116988f 2019-11-28 08:05:21.013678: E tensorflow/stream_executor/cuda/cuda_driver.cc:890] failed to alloc 4294967296 bytes on host: CUDA_ERROR_INVALID_VALUE: invalid argument 2019-11-28 08:05:21.016542: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 4294967296 ^C

Here is the link to my colab notebook for more info: https://colab.research.google.com/drive/1v9rxj_-BYC6daCpBdTLx5BzgjgZS-SEC

is it possible to somehow solve this problem with GPU? Thank you :)

itsuncheng commented 4 years ago

Any update to this problem? Having exactly the same issue here.

wingrunr21 commented 4 years ago

The ^C is due to running out of memory.

I will note that I'm having trouble fine-tuning a model even using Colab Pro with the higher resourcing. I've been playing around with batch sizes and the like and so far it's still blowing through the memory limits.

arturogatti commented 4 years ago

Not possible, must use a developer, costs around $65k. Stop commenting