rstriv / Know-Evolve

Implementation code for ICML '17 paper "Deep Temporal Reasoning for Dynamic Knowledge Graphs"
107 stars 22 forks source link

Program suddenly exits in first iteration #6

Closed Toni-Chan closed 5 years ago

Toni-Chan commented 5 years ago

In the first iteration of training the program suddenly exits with no error messages in the middle of iterating feed-forward computation. I was running run_small.sh after compiling using make, using the datasets supplied within the package. The program stops here: main.cpp: mainloop(), line 672-685

        inputs.clear();
        GetMiniBatch_SEQ(e, event_mini_batch);

        int train_in_batch = BuildTrainNet(T_begin, event_mini_batch, inputs, 
                            lookup_entity_onehot, lookup_rel_onehot,
                            lookup_entity_init, lookup_rel_init);

        **gnn.FeedForward(inputs, TRAIN);**
        auto loss_map = gnn.GetLoss();
                if (cfg::iter % cfg::report_interval == 0)
        {       
            Dtype nll = 0.0, avg_rank = 0.0, mae = 0.0, rmse = 0.0;
            for (auto it = loss_map.begin(); it != loss_map.end(); ++it)
            {

_nngraph.cpp(in graphnnbase): void NNGraph<mode, Dtype>::FeedForward(std::map<std::string, IMatrix<mode, Dtype>* > input, Phase phase), line 23-50

    for (size_t i = 0; i < ordered_layers.size(); ++i)
    {
        std::cerr << "Running batch " << i << " of " << ordered_layers.size() << "\n";
        assert(layer_dict.count(ordered_layers[i].first));
        auto* cur_layer = layer_dict[ordered_layers[i].first];
        auto& operands = ordered_layers[i].second;
        assert(name_idx_map.count(cur_layer->name));
        if (operands.size() == 0 && ! hash[name_idx_map[cur_layer->name]])
            continue;

        bool ready = true;
        for (auto* layer : operands)
        {
            if (static_layer_dict.count(layer->name))
                continue;
            assert(name_idx_map.count(layer->name));
            auto idx = name_idx_map[layer->name];
            ready &= hash[idx];
        }
        hash[name_idx_map[cur_layer->name]] = ready;
        if (ready)
            **cur_layer->UpdateOutput(operands, phase);**
        else if (phase != TEST)
            throw std::runtime_error("wrong computation flow");
    }

_param_layer.h(in graphnnbase): class ParamLayer, virtual void UpdateOutput(std::vector< ILayer<mode, Dtype>* >& operands, Phase phase)

    virtual void UpdateOutput(std::vector< ILayer<mode, Dtype>* >& operands, Phase phase) override
    {
        //**THE PROGRAM SUDDENLY STOPS HERE WITH RETURN CODE 0**
        assert(operands.size() == params.size());
        auto& cur_output = this->state->DenseDerived();
        for (size_t i = 0; i < operands.size(); ++i)
        {
            if (i == 0)
                params[i]->ResetOutput(operands[i]->state, &cur_output); 
            params[i]->UpdateOutput(operands[i]->state, &cur_output, i == 0 ? 0.0 : 1.0, phase);
        }
    }

Symptoms image

image It is the first ever iteration executed.

I have tracked the numbers of iterations that has been run. It stops in the middle. The "Running Here +id" is my tracking where the program stops. It stops here every time. It seems not because of the asserts. For dependencies I am using Intel mkl with compilers and libraries at 2019.1.144, parallel studio XE at 2019.1.053.

shellshock1911 commented 5 years ago

The problem is a multithreading bug in Intel MKL that originates in SparseSurvivalNllLayer.UpdateOutput(). To fix this, comment out all #pragma omp parallel for lines in sparse_survival_nll_layer.cpp. The model will then be able run through testing to completion.

Toni-Chan commented 5 years ago

Thanks. Removing all #pragma parallel signals solves this problem.