Closed cjchristopher closed 3 years ago
It could be that mark_nll
already contains the negative log-likelihood, so you don't need to negate it, as you do with log_prob
(I'm not 100% certain though).
If this doesn't solve the problem, here are a few follow-up questions:
log_prob
term or in mark_nll
?Thanks for the pointers - the data did indeed have duplicate timestamps - I've cleaned those up now.
My new problem, which I'd also appreciate any insight on, is the log_prob
terms running very negative very quickly (values in the tensor say, all ~-7), with the aggregation also subsequently running negative quite quickly. My dataset is perhaps a little more dense in time than the reddit one (surely just a scaling issue?), but with many fewer mark classes. I don't suppose you would know what feature of the dataset would cause this?
I should note that my mark_nll
terms are still positive.
In general, there is nothing wrong with having very negative values. As you said, this just reflects different scaling of the data. Simply rescaling the arrival (or inter-event times) should fix this. I guess a good idea is to rescale the times such that the average inter-arrival time is equal to one. It's important, though, that you scale all the sequences by the same factor.
A simple example to demonstrate the above point: Imagine having a uniform distribution p(x) = Uniform([0, 1])
, the log-density of any sample x \in [0, 1]
is log p(x) = 0
. However, if you simply rescale all the samples y = x * 1000
, then p(y) = Uniform([0, 1000])
with log p(y) = -log(1000)
for any y \in [0, 1000]
. A very similar thing happens with TPP densities, but the scaling is not as straightforward, as it also depends on the number of events.
Do you still get NaNs now after removing the duplicates?
Yes, removing the duplicates got rid of the NaNs. Thanks much
I've rescaled such that the average delta is 1 - although I do still see negative loss - some of my data points, even after scaling, are still very close together. The distribution of the inter-event times is very bimodal in my dataset which maybe poses a problem - I'll reduce the number of mixture components as per https://github.com/shchur/ifl-tpp/issues/5#issuecomment-667594957.
It's mentioned in the paper that you normalise the loss by subtracting the score of LogNormMix - is this already done in the code you have provided here? I see that model.log_prob ultimately ends up calling self.decoder.log_prob
(where decoder = LogNormMix
), so I guess it is, or is there something else required?
Lastly, I'm wondering if you had implemented at some point simulation/sampling with marks as well? - with reference to your response https://github.com/shchur/ifl-tpp/issues/6#issuecomment-679233400, I guess it would need to draw from model.mark_layer
. Is there a decoder for marks that would be required?
Thank you very much again for your time!
Subtract the loss of LogNormMix is done only for visualization in Figure 3. As I said before, we could arbitrarily shift the loss values for all models by the same amount by rescaling the inter-event times, so the absolute value of the loss for each model is irrelevant, only the differences between the models are (e.g. if two models have losses 200.1 and 200.5, we could change them to 0.1 and 0.5 by simple rescaling).
In case of marks, you would need to create a categorical distribution to sample the marks from
x = self.mark_layer(h)
x = F.log_softmax(x, dim=-1)
mark_distribution = torch.distributions.Categorical(logits=x)
Ah okay. Thanks for clarifying.
Since it relates to learning and simulation specifically in the case of marks I'll mention it here - I was able to use your code provided in the other issue for simulation without marks, but had some errors which I'm also not entirely sure how to correct when trying to sample from a model that has been trained with marks;
Notably:
RuntimeError: input.size(-1) must be equal to input_size. Expected <history_size+1>, got 1
wrt next_in_time = torch.zeros(1, 1, 1)
Naively changing the last term to be the expected size, then produces:
RuntimeError: Expected hidden[0] size (1, 1, history_size), got (1, history_size)
I don't quite think I'm resolving that correctly - any guidance is appreciated.
Here is the code that should work
from torch.distributions import Categorical
next_in_time = torch.zeros(1, 1, 1)
next_mark_emb = torch.zeros(1, 1, general_config.mark_embedding_size)
h = torch.zeros(1, 1, history_size)
inter_times = []
marks = []
t_max = 1000
with torch.no_grad():
while sum(inter_times) < t_max:
rnn_input = torch.cat([next_in_time, next_mark_emb], dim=-1)
_, h = model.rnn.step(rnn_input, h)
tau = model.decoder.sample(1, h)
inter_times.append(tau.item())
next_in_time = ((tau + 1e-8).log() - mean_in_train) / std_in_train
mark_logits = model.mark_layer(h)
mark_dist = Categorical(logits=mark_logits)
next_in_mark = mark_dist.sample()
marks.append(next_in_mark.item())
next_mark_emb = model.rnn.mark_embedding(next_in_mark)
Great! Thanks.
I've managed to modify if slightly so that it works with an LSTM, although that raised one additional question, since the LSTM hidden state output is a tuple.
For: _, h = model.rnn.step(rnn_input, h)
, I'm not sure why we are passing the hidden state, rather than the output encoding to the decoder? Should it be h, _ =
to retrieve the encoding of the history for the decoder? Apologies if there is some fundamental misunderstanding!
It's up to you to decide whether to use the hidden state or the output of the LSTM to obtain the conditional distribution. I don't have a strong intuition here. Probably, both version should work equally well.
Thanks for the release of your paper and code. In trying to implement learning with marks with the provided interactive notebook, adapting the remarks in the paper, I'm also running into some trouble. Based on appendix F.2. I assume it's a case of just adding the terms?
model.log_prob
in this case returns the(time_log_prob, mark_nll, accuracy)
- so adapting for the training loop, is it as simple as changing lines as below?:for
As a side problem - when doing the above with my custom dataset (which conforms to the same formatting as the example datasets, so arrival_times and marks), all loss terms are
NaN
. I'm wondering if you might have some insight as to why this might be occurring! When using the reddit dataset with the above modifications, I get non-zero loss terms for bothlog_prob
andmark_nll
.