rstriv / Know-Evolve

Implementation code for ICML '17 paper "Deep Temporal Reasoning for Dynamic Knowledge Graphs"
108 stars 22 forks source link

Stability related and MAR score/Conditional Density calculation issue #5

Open sumitpai opened 5 years ago

sumitpai commented 5 years ago

Dear Authors,

I was reading your paper and trying to implement it in tensorflow. However, while doing so I came across a few things which I couldn't understand. First, I will post the queries that I have related to the paper.

I also got your C++ code running on my system. I noticed a few weird things related to the computation of MAR Score - which is in turn related to the reported results. I will post these queries in the second part.

Queries about the paper:

  1. Equation 3 for the conditional intensity(lambda) is not bounded in the positive direction. So during optimisation it can explode depending on the gradient it receives (as lambda = exp(g)). How to prevent this from happening?

  2. In Equation 2, to compute the conditional density of an event, we have f = lambda * S. Aren't these two terms competing against each other? As lambda increases, S decreases. How do we ensure the stability while training?

  3. In section 4, you mention a trainable Parameter W_e, but this is not used in any of the equations in the paper. What is this related to?

  4. It is also mentioned that V_e is a trainable parameter. Do you mean that it is trainable because it evolves over time, or it is actually a trainable node (i.e. apart from temporal evolution, it gets updated during back prop)?

Queries related to code base I cloned the repo and got it running on my system. I was trying to understand the code and I noticed something weird. It is related to the way you compute the results - MAR score - in the SparseOnedimRankCriterionLayer class.

I ran the code, as it is, with the same set of hyperparameters as you have given and without any code change - just added a few missing includes to get it to compile on my system(see my forked repo).

While training I noticed that the MAR score for the train samples were 1 right from batch 1. So I debugged further. I noticed the following:

  1. In the SparseOnedimRankCriterionLayer class, the following lines of code compute the MAR score:

        this->loss = 1.0;
        ...
        sim = LogLL(sim, this->event_t - this->cur_time.GetCurTime(subject, object), true);
        //#pragma omp parallel for
        for (size_t i = 0; i < bg->entity_list.size(); ++i)
        {
            ...
    
            auto& other_feat = operands[bg->entity_idx[pred_object]]->state->DenseDerived();
            D.GeMM(B,other_feat,Trans::N, Trans::T, 1.0, 0.0);
            Dtype cur_sim = D.data[0]; 
            cur_sim = LogLL(cur_sim, this->event_t - this->cur_time.GetCurTime(subject, object), true);
            std::cerr<<cur_sim<<" "<< sim<<std::endl;
            if (cur_sim > sim && order == RankOrder::DESC)
            {
                this->loss++;
                std::cerr<<'---Loss Increased'<<std::endl;
            }
            ...
    
        }   

In the above snippet, the variable this->loss would have the MAR score at the end of the loop. You are computing the conditional density of the original triplet and in the for loop you are computing the same for the corruptions. If the corruptions have higher conditional density you decrease the rank. This is absolutely fine.

However, when I debugged the code further, I noticed that the sim and curr_sim values for most of the original samples and for most of the corruptions comes as -inf. SO, the comparision if (cur_sim > sim && order == RankOrder::DESC) yields False and the this->loss never changes, or in other words, the MAR score will be high(close to 1). I am attaching the screenshots of training and test. I printed the NLL and MAR score for every 10th batch while training. (Again, this is related to my ques 1 and 2 in the first section - the intensity explodes and -log(intensity) becomes -inf). There are very few instances where average rank is not 1 while training(cases where LogLL is not -inf).

I am trying to figure this out for weeks. Am I doing something wrong? I have not modified your code base except added few print statements. Can you please help me with this. What's even more weird is that the results in the test set are similar to the ones that you have reported in the paper. I have attached the results as well. The hits@10 >90% and avg_rank close to 200 but it is due to the incorrect way of evaluating it i.e. -inf comparison results in MAR score staying around 1 in most cases. I couldn't exactly get the values that you mentioned in the paper but they are close(as you are not setting the seed for random)

I am sure you may not have got -inf as conditional density(LogLL value for sim and curr_sim) or avg_rank == 1 right from the batch 1. So, I think you may have uploaded an incorrect/older dev version of the codebase. Can you please update the codebase if there was some mismatch while uploading. Are these the exact hyperparams(in the config file of run_large.sh) that you used for reporting the results in the paper? If not can you please share the exact hyperparameter file. Can you also seed the random number generator so that I can reproduce your results.

This is the link to the forked repo. Only changes are : I have added missing imports to cppformat and vector.h and removed parallel for as it was crashing for me.

https://github.com/sumitpai/Know-Evolve

Below are the screenshots of training and test: hyperparams

train_1

train_2

test

I look forward for your response.

jiwenfei commented 5 years ago

hi, Sorry ,i have some questions about understanding the paper at loss function, i do not know why it's survival term need sum in all subject entity , all object entity and all relations

sumitpai commented 5 years ago

As far as I understand, the survival term indicates that no events happen during the time interval between two events(observed). Hence you minimise the probability of the corruptions occurring during the interval.

sumitpai commented 5 years ago

@Authors I played with various hyper parameters and I always get similar output. I guess you are reusing the framework for your other paper as well and you have evaluated the KE framework on other datasets. Can you please verify this issue and please share the right set of hyper parameters where the LogLL doesn't go to infinity.

Currently I am getting excellent results on ICEWS-full, however that's because of the issue that I have mentioned above, in which case, the results would not be correct(as it doesn't rank the positive triplet against any corruption and holds the MAR score at 1 - i.e the if condition doesn't work due to -Infinity comparison)

rstriv commented 5 years ago

Queries about the paper:

  1. Equation 3 for the conditional intensity(lambda) is not bounded in the positive direction. So during optimisation it can explode depending on the gradient it receives (as lambda = exp(g)). How to prevent this from happening?
    • While you train over the whole dataset as a single sequence (rather than breaking the sequence as done in conventional RNN), you do the truncated backpropagation and sliding window training using a minibatch. The batch size helps to deal with the issue of vanishing and exploding gradients. The Global BPTT algorithm has more details on the training.
  2. In Equation 2, to compute the conditional density of an event, we have f = lambda * S. Aren't these two terms competing against each other? As lambda increases, S decreases. How do we ensure the stability while training?
    • This is not true. lambda is the probability that a single event happens in an instantaneous time window [t, t + delta t]. S is the probability that this event survives (hence S is called survival term), that is upto any given future time point t' (t'>t), no other event occurs. These two terms are not competing with each other, the term f is a conditional probability density function, hence, there is no issue of stability in this case.
  3. In section 4, you mention a trainable Parameter W_e, but this is not used in any of the equations in the paper. What is this related to?
    • Both W_e and W_r refer to the initial projection parameters that are used to project the entity and relation features (e.g. 1-hot vectors ) to low-dimensional embeddings
  4. It is also mentioned that V_e is a trainable parameter. Do you mean that it is trainable because it evolves over time, or it is actually a trainable node (i.e. apart from temporal evolution, it gets updated during back prop)?
    • V_e is the embedding holder that evolves over time as you have mentioned but not trained through backprop.
rstriv commented 5 years ago

hi, Sorry ,i have some questions about understanding the paper at loss function, i do not know why it's survival term need sum in all subject entity , all object entity and all relations

Consider that lambda signifies the probability of an event happening in an instantaneous interval [t, t + dt]. And the survival term signifies no further event happening upto some future time point t'. However, in this case, the happening or non-happening of event is qualified by a tensor - an event happens between a subject s, object o and of relationship r. And hence probability events not happening should also be computed across all the combinations possible in this tensor. This is why the sum over all three components is required.

sumitpai commented 5 years ago

Hi Rakshit, Any updates on this issue? Have you looked at the rank computation error that I have mentioned above?

woojeongjin commented 5 years ago

I also faced the same issue. Most sim and curr_sim values are -inf. I think this is because this->event_t - this->cur_time.GetCurTime(subject, object) = 0 in most cases.

rstriv commented 5 years ago

Hi Everyone,

Thank you for your queries and sorry for latency on my side.

We are aware of this issue and working to address it. As such, this is not due to a code problem but the artifact of dataset. In general, the model assumes continuous time data however, these relational datasets also have several events (involving same subject or object) at single time point. This leads to a situation where you get this->event_t - this->cur_time.GetCurTime(subject, object) = 0 in several samples.

This is not a problem during training as the rank is an auxiliary report and not used anywhere. During evaluation however, following happens: We update the embeddings of entities after each test point. However, in this particular case of multiple events at single time point, for a given test triple (sim), while computing the rank against other objects (cur_sim), the rank should not be affected by the object embeddings that underwent an update in the same time point. It should only consider object embedding that were updated in previous time point. Currently, when the cur_sim is computed for the objects that had an update in same time point, it gives -inf (due to t-t' term) for such objects and thereby ignores that object, essentially ranking only against objects that had got updated previously. However, I am currently working on a better way to handle this case without causing infinity. Please note the infinity issue won't occur if you dont have same entity participating in multiple events at same time point. Also, I found a couple of other minor fixes that needs to be included. I will post here in next 2-3 weeks to provide new updates. Please stay tuned.

Thanks