Closed maxbartolo closed 6 years ago
Also a note that the new data has roughly 1.3 times the number of data samples as SQuAD (ie SQuAD + more data)
@TimDettmers @dirkweissenborn what is the state of BiDaf, actually? Is it reproducing results yet?
@maxbartolo try reducing the batch_size by putting "batch_size=32" as an additional training argument. If contexts get too large you can set the "max_support_size" argument to something small, like 400. However, reducing the batchsize should solve the issue. BTW, what is this additional data and how does it look?
@riedelcastro It is not really tested yet. I will train a model next week and implement a new state of the art model.
On Dec 3, 2017 09:44, "Sebastian Riedel" notifications@github.com wrote:
@TimDettmers https://github.com/timdettmers @dirkweissenborn https://github.com/dirkweissenborn what is the state of BiDaf, actually? Is it reproducing results yet?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uclmr/jack/issues/344#issuecomment-348749476, or mute the thread https://github.com/notifications/unsubscribe-auth/ABU6G0rNV_olfD50adzGVypk2bPEZ1rVks5s8l-BgaJpZM4QzZ1I .
... @maxbartolo the warning message is not the problem
@dirkweissenborn thanks for the quick reply! Yeah, the warning message seems unrelated. I tried batch size 32.. it got further so started training and reached 100 iterations but then stopped (ResourceExhausted also). I normally train BiDAF with model (outside Jack) with batch size 60 on this data.
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5716,32,100]
[[Node: gradients/zeros_like_1 = ZerosLike[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bidaf_reader/bidaf/end_encoder/rnn/FW/BlockLSTM:1)]]
[[Node: bidaf_reader/cond/Max_1/_257 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1563_bidaf_reader/cond/Max_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Currently training on original data to see how far it will get (currently 800 iterations epoch 1 but it's only using 4.8GB of the GPU memory so seems like it will be fine). Then I'll try reducing "max_support_size".
New state of the art model sounds really exciting, just in case it helps, BiDAF + self-attention + ELMo would be really cool!
Well we don't have Elmo, but there will be a decent model that's comparable to the rest with glove next week. Are you working on the latest version of Jack? The old bidaf implementation was really suboptimal. The new one is much better. Can you try? Can you also explain what the difference between your data and the original squad is in more detail?
On Sun, Dec 3, 2017, 19:16 maxbartolo notifications@github.com wrote:
@dirkweissenborn https://github.com/dirkweissenborn thanks for the quick reply! Yeah, the warning message seems unrelated. I tried batch size 32.. it got further so started training and reached 100 iterations but then stopped (ResourceExhausted also). I normally train BiDAF with model (outside Jack) with batch size 60 on this data.
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[5716,32,100] [[Node: gradients/zeros_like_1 = ZerosLikeT=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[Node: bidaf_reader/cond/Max_1/_257 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1563_bidaf_reader/cond/Max_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Currently training on original data to see how far it will get (currently 800 iterations epoch 1 but it's only using 4.8GB of the GPU memory so seems like it will be fine). Then I'll try reducing "max_support_size".
New state of the art model sounds really exciting, just in case it helps, BiDAF + self-attention + ELMo would be really cool!
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/uclmr/jack/issues/344#issuecomment-348802654, or mute the thread https://github.com/notifications/unsubscribe-auth/ABU6G21udFpun-YF14c5hwRi5SmecDCvks5s8uVrgaJpZM4QzZ1I .
What is old vs new? I'm running a version I pulled from master branch on Thursday. The additional data is very similar to SQuAD, cleaned and any special characters converted. The only difference is that context lengths of the additional data are a bit longer than SQuAD (in terms of number of words). I can find the histograms tomorrow if they help.
I think your data is not read properly. From your error I can see that some contexts have 5716 tokens. This is huge! Can you check your data please?
Yes, there are some very long contexts in there. Will setting the max_support_size effectively restrict context size or should I just exclude any contexts say longer than 1k words?
... if it is the case, however, that some of your contexts are indeed soo large there are two options. 1) if your dataset is in Jack format, you can split the supporting texts into multiple texts (Jack can handle multiple supports), or 2) use the max_context_size=400
or something like that. This will truncate your support during training. However, after training you would have to overwrite this FLAG again when using the model on larger contexts or else the model will only consider the first 400 tokens. Here is how you can overwrite the flag after training given that your model is saved to MODEL_DIR.
from jack.readers.implementations import reader_from_file
r = reader_from_file(MODEL_DIR, {"max_support_size": -1})
# save your model with the updated flag if you want to
r.store(MODEL_DIR)
... note that option one is preferred since it will consider the entire context.
... truncation is also a bit smarter than just retaining the first N tokens. It basically always tries to cut a window out of the support that contains at least one answer.
another option is not to use BiDAF because it is a very resource intensive model ;)
That said, I think the AllenAI model scales to this data as far as I know.
Well, then this has something to do with either TF being resource unfriendly or that they do something like truncating as well. The initial BiDAF implementation was very resource unfriendly but that was fixed. what we can do though is setting swap memory to true when running our models. this will use RAM when unrolling RNNs.
In any case, the best solution is to split the supporting context into paragraphs and supply a list of strings as support. This will make everything much faster as well. BiDAF and similar models were just not meant to process entire documents sequentially.
@riedelcastro and even if AllenAI scales to that data (without automagically splitting the context internally as I was suggesting) the model will perform more poorly the larger the contexts get, because of attention.
Agreed!
btw. I am working on a modular QA model where you can stick together your model in your yaml config. It will support many of current SotA models out of the box, without having to write new code. will also make it easy to experiment with convolutional encoders vs RNNs for instance. I will create a PR tomorrow.
Sounds great. Will it also be easy to do this via the API?
Yes, no changes needed
@maxbartolo After splitting the context into paragraphs you can additionally set the flag max_num_support=2
(or something else than 2). This will select only the top-k paragraphs when scoring them using tf-idf against the question. This will additionally save resources and in most cases you will select the correct paragraphs.
Thanks Dirk, will take a look at the multiple supporting paragraph, that seems very interesting! Another small note, BiDAF hit a max of around 70% F1 Score (I'd expect around 77%). Is it only being evaluated against the first annotation in the data or are there further improvements to the model expected in the coming weeks?
I will get it to work this week with better results. Note, that the internal F1 scores will be about 1-2% below the actual F1 score using the official scoring script, which is a bit more lenient.
Sounds good, thanks!
In branch modular_qa
you can find updated implementations of bidaf
and jack_qa[_light]
(our novel architecture). You can use them for your experiments if you like. They will soon be in master
as well.
I recommend jack_qa_light
, because it is quite fast (twice as fast as bidaf
) while at the same time a good model (definitely better than bidaf
). However, I will evaluate it the next couple of days in more detail.
Hi @maxbartolo , finally some good news. The models in the modular_qa
branch work properly now, so you can go ahead and use them. Checkout the results for BiDAF and JackQA Light here. The other results are on the way.
Hey @dirkweissenborn, thanks! That looks great! Will try out those models in the next few days as I have some new data being processed which I want to add. Will let you know how it goes! I'm guessing those F1 scores are from the official script not the stricter version right?
@maxbartolo yes, they are. Internal metrics were about 3% lower actually ;)
Basically, when training on a 12GB Tesla K80 GPU on Google Cloud (you have to manually install tensorflow-gpu otherwise it defaults to CPU), I get a ResourceExhausted error seemingly during:
For the regular SQuAD data, the GPU is only consuming 5GB of memory, but on the added data (with longer context lengths), it's at max memory until exhaustion. The FastQA model trains fine on this data.