batch size > 1 for varying sequence lengths

ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction

MIT License

765 stars 187 forks source link

batch size > 1 for varying sequence lengths #16

Closed TeaPearce closed 7 years ago

TeaPearce commented 7 years ago

I'm experimenting with using the wtte model on teh CMAPPS dataset. I note you recommend using batch size = 1:

Loss was calculated as mean over timesteps. With batch-size 1 this means that each individual training sequence was given equal weight regardless of length.

However this is resulting in very slow training (around 200 secs per epoch with GPU).

Are there any workarounds for this? I've tried modifying the loss function so it's divided by the number of timesteps of interest (which I'd hoped would normalise it), but this doesn't produce good results.

ragulpr commented 7 years ago

I'm with you. Batch_size >1 is inshallah coming tomorrow in the updated data_pipeline. If you can't wait, the tests gives you a hint: https://github.com/ragulpr/wtte-rnn/blob/master/python/tests/test_keras.py#L93

There's allready support for it. You need to do masking and use sample weights 👌

TeaPearce commented 7 years ago

thanks! think I got it.

ragulpr commented 7 years ago

Note that if you use zero-mask then you may not learn anything from the nonevents as the input vec may be zero for those steps. Ping @daynebatten

daynebatten commented 7 years ago

Not sure I'm following, @ragulpr. Can you give a little more detail as to when this would occur?

ragulpr commented 7 years ago

@daynebatten I actually just noticed that you use model.add(Masking(mask_value=0., input_shape=(max_time, 24))) link

But it was there all along and sorry, actually it shouldn't be a problem for the Turbofan dataset since sensors measure . The problem would occur in data [data_pipeline}(https://github.com/ragulpr/wtte-rnn/blob/master/examples/data_pipeline/data_pipeline.ipynb)-example where we have long periods of nothingness, but we still want to propagate state. Here data could be set to 0. If we'd have mask_value=0 we wouldn't predict or learn anything here.

If you want to apply to other dataset I suggest using some highly unlikely mask-value instead.

Also, we never said hi awesome work @daynebatten! 👏🙌

daynebatten commented 7 years ago

That's a great point. It's probably a best practice to use a very large mask value. If you've normalized your data, the mask value should never come up by accident, and certainly not for all variables.

And I think most of the thanks go to you for doing all the heavy lifting here!

Manelmc commented 6 years ago

Hi Egil, I'm not sure I understand why do you suggest to use sample weights if we are already masking. From what I read in stackoverflow

If there's a mask in your model, it'll be propagated layer-by-layer and eventually applied to the loss. So if you're padding and masking the sequences in a correct way, the loss on the padding placeholders would be ignored.

If the loss on the padding placeholders is ignored, why do we need sample weights?

ragulpr commented 6 years ago

@Manelmc Good catch. While one would expect masking to propagate to the loss function, there was some inconsistencies in Keras at the time of implementation w.r.t how _weighted_masked_objective worked back then with custom loss function. It might have been fixed now. If that's the case, then sample_weights are not necessary if you want equal weights.

I like to use sample_weights anyway since I usually don't use equal weights for every sample.