uclnlp / jack

Jack the Reader
MIT License
257 stars 82 forks source link

ResourceExchausted Error when training BiDAF model on GPU with longer context data to SQuAD #344

Closed maxbartolo closed 6 years ago

maxbartolo commented 6 years ago

Basically, when training on a 12GB Tesla K80 GPU on Google Cloud (you have to manually install tensorflow-gpu otherwise it defaults to CPU), I get a ResourceExhausted error seemingly during:

UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

For the regular SQuAD data, the GPU is only consuming 5GB of memory, but on the added data (with longer context lengths), it's at max memory until exhaustion. The FastQA model trains fine on this data.

maxbartolo commented 6 years ago

Also a note that the new data has roughly 1.3 times the number of data samples as SQuAD (ie SQuAD + more data)

riedelcastro commented 6 years ago

@TimDettmers @dirkweissenborn what is the state of BiDaf, actually? Is it reproducing results yet?

dirkweissenborn commented 6 years ago

@maxbartolo try reducing the batch_size by putting "batch_size=32" as an additional training argument. If contexts get too large you can set the "max_support_size" argument to something small, like 400. However, reducing the batchsize should solve the issue. BTW, what is this additional data and how does it look?

@riedelcastro It is not really tested yet. I will train a model next week and implement a new state of the art model.

On Dec 3, 2017 09:44, "Sebastian Riedel" notifications@github.com wrote:

@TimDettmers https://github.com/timdettmers @dirkweissenborn https://github.com/dirkweissenborn what is the state of BiDaf, actually? Is it reproducing results yet?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uclmr/jack/issues/344#issuecomment-348749476, or mute the thread https://github.com/notifications/unsubscribe-auth/ABU6G0rNV_olfD50adzGVypk2bPEZ1rVks5s8l-BgaJpZM4QzZ1I .

dirkweissenborn commented 6 years ago

... @maxbartolo the warning message is not the problem

maxbartolo commented 6 years ago

@dirkweissenborn thanks for the quick reply! Yeah, the warning message seems unrelated. I tried batch size 32.. it got further so started training and reached 100 iterations but then stopped (ResourceExhausted also). I normally train BiDAF with model (outside Jack) with batch size 60 on this data.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5716,32,100]
         [[Node: gradients/zeros_like_1 = ZerosLike[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bidaf_reader/bidaf/end_encoder/rnn/FW/BlockLSTM:1)]]
         [[Node: bidaf_reader/cond/Max_1/_257 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1563_bidaf_reader/cond/Max_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Currently training on original data to see how far it will get (currently 800 iterations epoch 1 but it's only using 4.8GB of the GPU memory so seems like it will be fine). Then I'll try reducing "max_support_size".

New state of the art model sounds really exciting, just in case it helps, BiDAF + self-attention + ELMo would be really cool!

dirkweissenborn commented 6 years ago

Well we don't have Elmo, but there will be a decent model that's comparable to the rest with glove next week. Are you working on the latest version of Jack? The old bidaf implementation was really suboptimal. The new one is much better. Can you try? Can you also explain what the difference between your data and the original squad is in more detail?

On Sun, Dec 3, 2017, 19:16 maxbartolo notifications@github.com wrote:

@dirkweissenborn https://github.com/dirkweissenborn thanks for the quick reply! Yeah, the warning message seems unrelated. I tried batch size 32.. it got further so started training and reached 100 iterations but then stopped (ResourceExhausted also). I normally train BiDAF with model (outside Jack) with batch size 60 on this data.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5716,32,100] [[Node: gradients/zeros_like_1 = ZerosLikeT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: bidaf_reader/cond/Max_1/_257 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1563_bidaf_reader/cond/Max_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Currently training on original data to see how far it will get (currently 800 iterations epoch 1 but it's only using 4.8GB of the GPU memory so seems like it will be fine). Then I'll try reducing "max_support_size".

New state of the art model sounds really exciting, just in case it helps, BiDAF + self-attention + ELMo would be really cool!

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/uclmr/jack/issues/344#issuecomment-348802654, or mute the thread https://github.com/notifications/unsubscribe-auth/ABU6G21udFpun-YF14c5hwRi5SmecDCvks5s8uVrgaJpZM4QzZ1I .

maxbartolo commented 6 years ago

What is old vs new? I'm running a version I pulled from master branch on Thursday. The additional data is very similar to SQuAD, cleaned and any special characters converted. The only difference is that context lengths of the additional data are a bit longer than SQuAD (in terms of number of words). I can find the histograms tomorrow if they help.

dirkweissenborn commented 6 years ago

I think your data is not read properly. From your error I can see that some contexts have 5716 tokens. This is huge! Can you check your data please?

maxbartolo commented 6 years ago

Yes, there are some very long contexts in there. Will setting the max_support_size effectively restrict context size or should I just exclude any contexts say longer than 1k words?

dirkweissenborn commented 6 years ago

... if it is the case, however, that some of your contexts are indeed soo large there are two options. 1) if your dataset is in Jack format, you can split the supporting texts into multiple texts (Jack can handle multiple supports), or 2) use the max_context_size=400 or something like that. This will truncate your support during training. However, after training you would have to overwrite this FLAG again when using the model on larger contexts or else the model will only consider the first 400 tokens. Here is how you can overwrite the flag after training given that your model is saved to MODEL_DIR.

from jack.readers.implementations import reader_from_file
r = reader_from_file(MODEL_DIR, {"max_support_size": -1})
# save your model with the updated flag if you want to
r.store(MODEL_DIR)
dirkweissenborn commented 6 years ago

... note that option one is preferred since it will consider the entire context.

dirkweissenborn commented 6 years ago

... truncation is also a bit smarter than just retaining the first N tokens. It basically always tries to cut a window out of the support that contains at least one answer.

dirkweissenborn commented 6 years ago

another option is not to use BiDAF because it is a very resource intensive model ;)

riedelcastro commented 6 years ago

That said, I think the AllenAI model scales to this data as far as I know.

dirkweissenborn commented 6 years ago

Well, then this has something to do with either TF being resource unfriendly or that they do something like truncating as well. The initial BiDAF implementation was very resource unfriendly but that was fixed. what we can do though is setting swap memory to true when running our models. this will use RAM when unrolling RNNs.

dirkweissenborn commented 6 years ago

In any case, the best solution is to split the supporting context into paragraphs and supply a list of strings as support. This will make everything much faster as well. BiDAF and similar models were just not meant to process entire documents sequentially.

dirkweissenborn commented 6 years ago

@riedelcastro and even if AllenAI scales to that data (without automagically splitting the context internally as I was suggesting) the model will perform more poorly the larger the contexts get, because of attention.

riedelcastro commented 6 years ago

Agreed!

dirkweissenborn commented 6 years ago

btw. I am working on a modular QA model where you can stick together your model in your yaml config. It will support many of current SotA models out of the box, without having to write new code. will also make it easy to experiment with convolutional encoders vs RNNs for instance. I will create a PR tomorrow.

riedelcastro commented 6 years ago

Sounds great. Will it also be easy to do this via the API?

dirkweissenborn commented 6 years ago

Yes, no changes needed

dirkweissenborn commented 6 years ago

@maxbartolo After splitting the context into paragraphs you can additionally set the flag max_num_support=2 (or something else than 2). This will select only the top-k paragraphs when scoring them using tf-idf against the question. This will additionally save resources and in most cases you will select the correct paragraphs.

maxbartolo commented 6 years ago

Thanks Dirk, will take a look at the multiple supporting paragraph, that seems very interesting! Another small note, BiDAF hit a max of around 70% F1 Score (I'd expect around 77%). Is it only being evaluated against the first annotation in the data or are there further improvements to the model expected in the coming weeks?

dirkweissenborn commented 6 years ago

I will get it to work this week with better results. Note, that the internal F1 scores will be about 1-2% below the actual F1 score using the official scoring script, which is a bit more lenient.

maxbartolo commented 6 years ago

Sounds good, thanks!

dirkweissenborn commented 6 years ago

In branch modular_qa you can find updated implementations of bidaf and jack_qa[_light] (our novel architecture). You can use them for your experiments if you like. They will soon be in master as well.

I recommend jack_qa_light, because it is quite fast (twice as fast as bidaf) while at the same time a good model (definitely better than bidaf). However, I will evaluate it the next couple of days in more detail.

dirkweissenborn commented 6 years ago

Hi @maxbartolo , finally some good news. The models in the modular_qa branch work properly now, so you can go ahead and use them. Checkout the results for BiDAF and JackQA Light here. The other results are on the way.

maxbartolo commented 6 years ago

Hey @dirkweissenborn, thanks! That looks great! Will try out those models in the next few days as I have some new data being processed which I want to add. Will let you know how it goes! I'm guessing those F1 scores are from the official script not the stricter version right?

dirkweissenborn commented 6 years ago

@maxbartolo yes, they are. Internal metrics were about 3% lower actually ;)