shogun-toolbox / shogun

Shōgun
http://shogun-toolbox.org
BSD 3-Clause "New" or "Revised" License
3.02k stars 1.04k forks source link

Assertion error from Eigen with feature pruning in CPruneVarSubMean #3952

Open lacava opened 6 years ago

lacava commented 6 years ago

I'm trying to do simple least angle regression on data that commonly has zero variance features that need to be pruned. After running an instance of CPruneVarSubMean and CNormOne, I get an assertion error from the lars->train() step when the preprocessing step prunes the feature set. The assertion error comes from Eigen's DenseBase.h and says

Assertion `rows == this->rows() && cols == this->cols() && "DenseBase::resize() does not actually allow to resize."' failed.

I have written a minimum working example here, adapted from the example least_angle_regression.cpp :

#include <shogun/base/init.h>
#include <shogun/base/some.h>
#include <shogun/labels/RegressionLabels.h>
#include <shogun/lib/SGVector.h>
#include <shogun/preprocessor/NormOne.h>
#include <shogun/regression/LeastAngleRegression.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/preprocessor/PruneVarSubMean.h>
#include <Eigen/Dense>
#include <iostream>

using namespace shogun;

int main(int, char*[])
{
init_shogun_with_defaults();

// input data is an Eigen matrix with a row of constant values
SGMatrix<float64_t> X(3,4);
X(0,0) = 1; X(0,1) = 1; X(0,2) = 1; X(0,3)=1;
X(1,0) = 1; X(1,1) = 2; X(1,2) = 3; X(1,3)=4;
X(2,0) = 5; X(2,1) = 7; X(2,2) = 6; X(2,3)=8;
// response
SGVector<float64_t> Y(4);
Y[0] = 1; Y[1] = 2; Y[2] = 3; Y[3]=4;

//print matrices
std::cout<< "X: ";
X.display_matrix();
std::cout<< "Y: ";
Y.display_vector();

//![create_features]
auto features_train = some<CDenseFeatures<float64_t>>(X);
auto labels_train = some<CRegressionLabels>(Y);
std::cout << "initial features_train: "
          << (*features_train).get_num_vectors()
<< "x" << (*features_train).get_num_features() << "\n";
//![preprocess_features]
auto SubMean = some<CPruneVarSubMean>();
auto Normalize = some<CNormOne>();
SubMean->init(features_train);
SubMean->apply_to_feature_matrix(features_train);
Normalize->init(features_train);
Normalize->apply_to_feature_matrix(features_train);
//![preprocess_features]
std::cout << "pruned features_train:"
          << (*features_train).get_num_vectors()
<< "x" << (*features_train).get_num_features() << "\n";
//![create_instance]
auto lamda1 = 0.01;
auto lars = some<CLeastAngleRegression>(false);
lars->set_features(features_train);
lars->set_labels(labels_train);
lars->set_max_l1_norm(lamda1);
//![create_instance]

//![train_and_apply]
lars->train();

exit_shogun();
return 0;
}

if I change the first feature of X to

X(0,0) = 1; X(0,1) = -1; X(0,2) = 1; X(0,3)=-1;

I get a normal output:

X: matrix=[
[   1,  -1, 1,  -1],
[   1,  2,  3,  4],
[   5,  7,  6,  8]
]
Y: vector=[1,2,3,4]
initial features_train: 4x3
pruned features_train:4x3
lacava commented 6 years ago

I'm not sure if I haven't gotten a response because this is a fairly obvious mistake on my part in dealing with the feature inputs. But I did try something more simple that makes me think it really is a bug in PruneVarSubMean.

From the example data, I edited regression_1d_linear_features_train.dat / test.dat to add a zero variance column, i.e.

$ head regression_2d_linear_features_train.dat
2.930347576400018639e+00,1.0
9.818686505333067416e+00,1.0
9.770739605572543951e+00,1.0
6.873307411926963262e+00,1.0
4.524304590676347715e+00,1.0
2.687774873766835881e+00,1.0
2.286970915807879923e+00,1.0
6.850411193297126999e+00,1.0
5.066365065222648845e+00,1.0
3.221632613529651579e+00,1.0

then edited least_angle_regression.cpp to use this dataset, i.e.

 18 auto f_feats_train = some<CCSVFile>("../../data/regression_2d_linear_features_train.dat");
 19 auto f_feats_test = some<CCSVFile>("../../data/regression_2d_linear_features_test.dat");

and I get the same error:

$ g++ least_angle_regression.cpp -lshogun -L/usr/lib/libshogun.* -O0 -std=c++11 -ggdb -o least_angle_regression
$ ./least_angle_regression
initial features_train: 30x2
pruned features_train:30x1
least_angle_regression: /usr/include/eigen3/Eigen/src/Core/DenseBase.h:261: void Eigen::DenseBase<Derived>::resize(Eigen::Index, Eigen::Index) [with Derived = Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >; Eigen::Index = long int]: Assertion `rows == this->rows() && cols == this->cols() && "DenseBase::resize() does not actually allow to resize."' failed.
Aborted (core dumped)

if i comment out the PurneVarSubMean transformation in least_angle_regression.cpp (in least_angle_regression_noprune.cpp), I get rid of the error.

 33 //auto SubMean = some<CPruneVarSubMean>();
 34 auto Normalize = some<CNormOne>();
 35 //SubMean->init(features_train);
 36 //SubMean->apply_to_feature_matrix(features_train);
 37 //SubMean->apply_to_feature_matrix(features_test);
$./least_angle_regression_noprune
initial features_train: 30x2
pruned features_train:30x2

Any ideas? I asked a question on stack overflow too and a comment pointed me to the need to use C++ placement new syntax to re-map Eigen matrices to the underlying data (see here). I'm not sure how this is handled internally.

EDIT: attaching the code & data for reference: least_angle_regression.cpp

#include <shogun/base/init.h>
#include <shogun/base/some.h>
#include <shogun/evaluation/MeanSquaredError.h>
#include <shogun/labels/RegressionLabels.h>
#include <shogun/lib/SGVector.h>
#include <shogun/io/CSVFile.h>
#include <shogun/preprocessor/NormOne.h>
#include <shogun/regression/LeastAngleRegression.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/preprocessor/PruneVarSubMean.h>
#include <iostream>
using namespace shogun;

int main(int, char*[])
{
init_shogun_with_defaults();

auto f_feats_train = some<CCSVFile>("../../data/regression_2d_linear_features_train.dat");
auto f_feats_test = some<CCSVFile>("../../data/regression_2d_linear_features_test.dat");
auto f_labels_train = some<CCSVFile>("../../data/regression_1d_linear_labels_train.dat");
auto f_labels_test = some<CCSVFile>("../../data/regression_1d_linear_labels_test.dat");

//![create_features]
auto features_train = some<CDenseFeatures<float64_t>>(f_feats_train);
auto features_test = some<CDenseFeatures<float64_t>>(f_feats_test);
auto labels_train = some<CRegressionLabels>(f_labels_train);
auto labels_test = some<CRegressionLabels>(f_labels_test);
//![create_features]
std::cout << "initial features_train: "
          << (*features_train).get_num_vectors()
<< "x" << (*features_train).get_num_features() << "\n";
//![preprocess_features]
auto SubMean = some<CPruneVarSubMean>();
auto Normalize = some<CNormOne>();
SubMean->init(features_train);
SubMean->apply_to_feature_matrix(features_train);
SubMean->apply_to_feature_matrix(features_test);
Normalize->init(features_train);
Normalize->apply_to_feature_matrix(features_train);
Normalize->apply_to_feature_matrix(features_test);
//![preprocess_features]

//![create_instance]
auto lamda1 = 0.01;
auto lars = some<CLeastAngleRegression>(false);
lars->set_features(features_train);
lars->set_labels(labels_train);
lars->set_max_l1_norm(lamda1);
//![create_instance]
std::cout << "pruned features_train:"
          << (*features_train).get_num_vectors()
<< "x" << (*features_train).get_num_features() << "\n";
//![train_and_apply]
lars->train();
auto labels_predict = lars->apply_regression(features_test);

//[!extract_w]
auto weights = lars->get_w();
//[!extract_w]

//![evaluate_error]
auto eval = some<CMeanSquaredError>();
auto mse = eval->evaluate(labels_predict, labels_test);
//![evaluate_error]

// integration testing variables
auto output = labels_test->get_labels();

exit_shogun();
return 0;
}

regression_2d_linear_features_train.dat:

  1 2.930347576400018639e+00,1.0
  2 9.818686505333067416e+00,1.0
  3 9.770739605572543951e+00,1.0
  4 6.873307411926963262e+00,1.0
  5 4.524304590676347715e+00,1.0
  6 2.687774873766835881e+00,1.0
  7 2.286970915807879923e+00,1.0
  8 6.850411193297126999e+00,1.0
  9 5.066365065222648845e+00,1.0
 10 3.221632613529651579e+00,1.0
 11 5.415783724177677172e+00,1.0
 12 8.268298303193043708e+00,1.0
 13 1.318617492920814982e-01,1.0
 14 6.433781860748214676e+00,1.0
 15 2.544354433563237095e+00,1.0
 16 5.784357316781001401e+00,1.0
 17 6.004094745835546476e+00,1.0
 18 2.088352529087369458e+00,1.0
 19 9.089899026693386119e+00,1.0
 20 4.770065714788831457e+00,1.0
 21 6.536621824367120581e+00,1.0
 22 8.065220778914559574e+00,1.0
 23 9.340945531926289291e+00,1.0
 24 3.725428382722462128e-01,1.0
 25 6.491440262005463424e+00,1.0
 26 8.970417958620853227e+00,1.0
 27 6.136289194149116888e+00,1.0
 28 7.101150726097365862e+00,1.0
 29 2.281715214043989803e+00,1.0
 30 6.095042052142956912e+00,1.0

regression_2d_linear_features_test.dat:

6.501454885037460052e+00,1.0
5.711892098470410239e+00,1.0
8.506213700179728221e+00,1.0
4.418215228904445624e+00,1.0
4.995308144087388769e+00,1.0
vigsterkr commented 6 years ago

@lacava it's not trivial at all it's just that all of us are a bit busy. :( sorry that nobody looked into this yet... i'll try to look into it and see what could be done or at least what's the main issue. sorry again and thanks heaps for this awesome bug report!