Epoch's evaluation takes an unusual long time

PonteIneptique commented 7 years ago

Compared to previous Keras implementation, the evaluation takes a really long time (on GPU). It takes few minutes to evaluate while it takes ~ 30/35 seconds to train...

This could be related to #54. Could it be things are run twice ?

PonteIneptique commented 7 years ago

So I definitely can confirm there is an huge issue of performance here : it usually took me 30 to 40 minutes on GPU to train a whole network for medieval French (200k tokens, 150 epochs). The PyTorch model ran for the whole night (9PM-7.18AM) and I am only at epoch 88.

PonteIneptique commented 7 years ago

Additional notes : the overall epoch fitting takes a little more time, 27s instead of 17s. But I don't think this explain the whole efficiency drop.

PonteIneptique commented 7 years ago

Configuration for more details :

# Configuration file for the Pandora system
[global]
nb_encoding_layers = 2
nb_dense_dims = 1000
batch_size = 100
nb_left_tokens = 2
nb_right_tokens = 1
nb_embedding_dims = 100
model_dir = models/chrestien
postcorrect = False
include_token = True
include_context = True
include_lemma = label
include_pos = True
include_morph = False
include_dev = True
include_test = True
nb_filters = 150
min_token_freq_emb = 5
filter_length = 3
focus_repr = convolutions
dropout_level = 0.15
nb_epochs = 150
halve_lr_at = 75
max_token_len = 20
min_lem_cnt = 1
model = PyTorch
max_lemma_len = 32

PonteIneptique commented 7 years ago

It takes more or less 20 minutes to eval scores.

Ideas of where we might lose performances :

Batch Size. I am thinking we do not need batch size on testing/eval. Maybe by running everything at once, it'd perform a little better
GPU to CPU conversion. Memory transfer could be an issue. I have no idea how to deal with this one. Right now we are converting each value one by one. Maybe there is a way to deal with them in group ?
Maybe predict would benefit to be CPU only ? Need to see if it would make things faster

emanjavacas commented 7 years ago

I will try to reproduce this. I haven't encountered the issue (same with #54) running on train.py. Could you try to debug a bit starting from there?

With respect to your ideas, I have already referred to a) and c) somewhere else. Basically, during inference you want as high a batch_size as you can afford, since it doesn't have an effect on the output (tagging), same doesn't apply during training. b) shouldn't be an issue per se.

One bottleneck is that the entire pipeline still feels too handcrafted with the keras model in mind. The pytorch model could benefit (in terms of speed and performance) from changes in the way the data is loaded, but that would need some considerable amount of refactoring in the client code.

PonteIneptique commented 7 years ago

If you have not encountered this issue, I could think it comes from the test corpora, as if you were using train.py, there was no use of test corpora...

PonteIneptique commented 7 years ago

I added a branch to keep track of what's going on with predict : predict.txt I force batches of size 1000 on purpose. Predict takes gradually more time apparently and stabilize around 0.40 which amounts to 69s more or less.

I guess having batches of size 100 is pretty bad for predict... Maybe we introduce a batch_size_predict argument ? The same computation stabilizes around 0.20s for 100 sized batches at 1k batches (and still grows after that).

PonteIneptique commented 7 years ago

Code for eval : https://github.com/hipster-philology/pandora/blob/eval-drop-perfomance/pandora/impl/pytorch/model.py#L368-L370

PonteIneptique commented 7 years ago

Note : I moved some issues into #64 and #65

Please make sure to open issues when a separate bug arises. This keeps discussion clean and understandable

mikekestemont commented 7 years ago

@emanjavacas I can reproduce your error and will look into now.

PonteIneptique commented 7 years ago

@mikekestemont Could it be this comment is about #65 ?

mikekestemont commented 7 years ago

yes, i'll take to there.

On Wed, Oct 18, 2017 at 1:09 PM, Thibault Clérice notifications@github.com wrote:

@mikekestemont https://github.com/mikekestemont Could it be this comment is about #65 https://github.com/hipster-philology/pandora/issues/65 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hipster-philology/pandora/issues/55#issuecomment-337555932, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJLyIQ2bjLD9NzTph6NoMfNYCjFwksks5stdxMgaJpZM4Prqng .

PonteIneptique commented 7 years ago

To add a little more background on this issue :

100-sized batches are processed really really fast in Keras
1000-sized batches are processed with the same amount of time in PyTorch

PonteIneptique commented 7 years ago

Regarding this, I think the best thing to do is to open a new parameter : test_batch_size for enhancing speed of eval without affecting the training batch_size

emanjavacas commented 7 years ago

So I gather the issue was that evaluation was done with a very small batch size?

PonteIneptique commented 7 years ago

It was definitely a question of batch size. The weird thing is that this was much much more efficient in Keras for some reason.

emanjavacas commented 7 years ago

Mhh, could you elaborate? I still can't see why a small batch size would lead to exponentially increasing running time during evaluation (as shown in your file predict.txt).

PonteIneptique commented 7 years ago

Keras somehow did not take as much time to eval with the same batch size. I have absolutely no idea what's the cause of it.

The one thing I did not check is if the network for eval was CPU or GPU. But even then... I don't see how that would create that much of a difference...

I know it's one way to fix this time consuming issue, I don't know where it comes from. Mostly because the training time is about the same...

emanjavacas commented 7 years ago

So, but then the running time wasn't growing exponentially but it was just slower? If you were running on the CPU one reason might be that pytorch (especially older versions) has been shown to be slower than other engines...

2017-10-23 17:06 GMT+02:00 Thibault Clérice notifications@github.com:

Keras somehow did not take as much time to eval with the same batch size. I have absolutely no idea what's the cause of it.

The one thing I did not check is if the network for eval was CPU or GPU. But even then... I don't see how that would create that much of a difference...

I know it's one way to fix this time consuming issue, I don't know where it comes from. Mostly because the training time is about the same...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hipster-philology/pandora/issues/55#issuecomment-338689770, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6Ho2g3xEoP2zkah1BL9lMVvCFzvQ0Mks5svKtagaJpZM4Prqng .

-- Enrique Manjavacas.

PonteIneptique commented 7 years ago

I know understand your question. It was both slower and growing. I did not evaluate if this growing eval time happened with the new test batch size. I am pretty sure it might. But it might not be as important because of the number of batches (?)

I am running everything on GPU though. Gotta use the best available part of the PC :)

emanjavacas commented 7 years ago

Ok, then if running time is growing exponentially, we definitely need to debug it. That should not happen. There is no reason why two consecutive batches of the same size should take different time. Could you check if the memory usage is also increasing?

2017-10-23 17:24 GMT+02:00 Thibault Clérice notifications@github.com:

I know understand your question. It was both slower and growing. I did not evaluate if this growing eval time happened with the new test batch size. I am pretty sure it might. But it might not be as important because of the number of batches (?)

I am running everything on GPU though. Gotta use the best available part of the PC :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hipster-philology/pandora/issues/55#issuecomment-338696043, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6Ho5GaFdkyguUnINHhoS8nCyb3MHmgks5svK-3gaJpZM4Prqng .

-- Enrique Manjavacas.

emanjavacas commented 7 years ago

I've traced back an issue with pytorch training, which might be the culprit for what you were seeing. It affects training and not evaluation though. It is related to Adam and can't see right now why this is happening. Basically after a number of epoch, there is a sudden explosion in the size of the gradient and training slows down by a factor of 5. I've opened an issue in the pytorch discussion forum to see if somebody can shed light on it.

https://discuss.pytorch.org/t/considerable-slowdown-in-adam-step-after-a-number-of-epochs/9185

For now, switching to Adagrad solves it.

hipster-philology / pandora

Epoch's evaluation takes an unusual long time #55