Open lacava opened 6 years ago
I'm not sure if I haven't gotten a response because this is a fairly obvious mistake on my part in dealing with the feature inputs. But I did try something more simple that makes me think it really is a bug in PruneVarSubMean.
From the example data, I edited regression_1d_linear_features_train.dat / test.dat to add a zero variance column, i.e.
$ head regression_2d_linear_features_train.dat
2.930347576400018639e+00,1.0
9.818686505333067416e+00,1.0
9.770739605572543951e+00,1.0
6.873307411926963262e+00,1.0
4.524304590676347715e+00,1.0
2.687774873766835881e+00,1.0
2.286970915807879923e+00,1.0
6.850411193297126999e+00,1.0
5.066365065222648845e+00,1.0
3.221632613529651579e+00,1.0
then edited least_angle_regression.cpp to use this dataset, i.e.
18 auto f_feats_train = some<CCSVFile>("../../data/regression_2d_linear_features_train.dat");
19 auto f_feats_test = some<CCSVFile>("../../data/regression_2d_linear_features_test.dat");
and I get the same error:
$ g++ least_angle_regression.cpp -lshogun -L/usr/lib/libshogun.* -O0 -std=c++11 -ggdb -o least_angle_regression
$ ./least_angle_regression
initial features_train: 30x2
pruned features_train:30x1
least_angle_regression: /usr/include/eigen3/Eigen/src/Core/DenseBase.h:261: void Eigen::DenseBase<Derived>::resize(Eigen::Index, Eigen::Index) [with Derived = Eigen::Map<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::Stride<0, 0> >; Eigen::Index = long int]: Assertion `rows == this->rows() && cols == this->cols() && "DenseBase::resize() does not actually allow to resize."' failed.
Aborted (core dumped)
if i comment out the PurneVarSubMean transformation in least_angle_regression.cpp (in least_angle_regression_noprune.cpp), I get rid of the error.
33 //auto SubMean = some<CPruneVarSubMean>();
34 auto Normalize = some<CNormOne>();
35 //SubMean->init(features_train);
36 //SubMean->apply_to_feature_matrix(features_train);
37 //SubMean->apply_to_feature_matrix(features_test);
$./least_angle_regression_noprune
initial features_train: 30x2
pruned features_train:30x2
Any ideas? I asked a question on stack overflow too and a comment pointed me to the need to use C++ placement new syntax to re-map Eigen matrices to the underlying data (see here). I'm not sure how this is handled internally.
EDIT: attaching the code & data for reference: least_angle_regression.cpp
#include <shogun/base/init.h>
#include <shogun/base/some.h>
#include <shogun/evaluation/MeanSquaredError.h>
#include <shogun/labels/RegressionLabels.h>
#include <shogun/lib/SGVector.h>
#include <shogun/io/CSVFile.h>
#include <shogun/preprocessor/NormOne.h>
#include <shogun/regression/LeastAngleRegression.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/preprocessor/PruneVarSubMean.h>
#include <iostream>
using namespace shogun;
int main(int, char*[])
{
init_shogun_with_defaults();
auto f_feats_train = some<CCSVFile>("../../data/regression_2d_linear_features_train.dat");
auto f_feats_test = some<CCSVFile>("../../data/regression_2d_linear_features_test.dat");
auto f_labels_train = some<CCSVFile>("../../data/regression_1d_linear_labels_train.dat");
auto f_labels_test = some<CCSVFile>("../../data/regression_1d_linear_labels_test.dat");
//![create_features]
auto features_train = some<CDenseFeatures<float64_t>>(f_feats_train);
auto features_test = some<CDenseFeatures<float64_t>>(f_feats_test);
auto labels_train = some<CRegressionLabels>(f_labels_train);
auto labels_test = some<CRegressionLabels>(f_labels_test);
//![create_features]
std::cout << "initial features_train: "
<< (*features_train).get_num_vectors()
<< "x" << (*features_train).get_num_features() << "\n";
//![preprocess_features]
auto SubMean = some<CPruneVarSubMean>();
auto Normalize = some<CNormOne>();
SubMean->init(features_train);
SubMean->apply_to_feature_matrix(features_train);
SubMean->apply_to_feature_matrix(features_test);
Normalize->init(features_train);
Normalize->apply_to_feature_matrix(features_train);
Normalize->apply_to_feature_matrix(features_test);
//![preprocess_features]
//![create_instance]
auto lamda1 = 0.01;
auto lars = some<CLeastAngleRegression>(false);
lars->set_features(features_train);
lars->set_labels(labels_train);
lars->set_max_l1_norm(lamda1);
//![create_instance]
std::cout << "pruned features_train:"
<< (*features_train).get_num_vectors()
<< "x" << (*features_train).get_num_features() << "\n";
//![train_and_apply]
lars->train();
auto labels_predict = lars->apply_regression(features_test);
//[!extract_w]
auto weights = lars->get_w();
//[!extract_w]
//![evaluate_error]
auto eval = some<CMeanSquaredError>();
auto mse = eval->evaluate(labels_predict, labels_test);
//![evaluate_error]
// integration testing variables
auto output = labels_test->get_labels();
exit_shogun();
return 0;
}
regression_2d_linear_features_train.dat:
1 2.930347576400018639e+00,1.0
2 9.818686505333067416e+00,1.0
3 9.770739605572543951e+00,1.0
4 6.873307411926963262e+00,1.0
5 4.524304590676347715e+00,1.0
6 2.687774873766835881e+00,1.0
7 2.286970915807879923e+00,1.0
8 6.850411193297126999e+00,1.0
9 5.066365065222648845e+00,1.0
10 3.221632613529651579e+00,1.0
11 5.415783724177677172e+00,1.0
12 8.268298303193043708e+00,1.0
13 1.318617492920814982e-01,1.0
14 6.433781860748214676e+00,1.0
15 2.544354433563237095e+00,1.0
16 5.784357316781001401e+00,1.0
17 6.004094745835546476e+00,1.0
18 2.088352529087369458e+00,1.0
19 9.089899026693386119e+00,1.0
20 4.770065714788831457e+00,1.0
21 6.536621824367120581e+00,1.0
22 8.065220778914559574e+00,1.0
23 9.340945531926289291e+00,1.0
24 3.725428382722462128e-01,1.0
25 6.491440262005463424e+00,1.0
26 8.970417958620853227e+00,1.0
27 6.136289194149116888e+00,1.0
28 7.101150726097365862e+00,1.0
29 2.281715214043989803e+00,1.0
30 6.095042052142956912e+00,1.0
regression_2d_linear_features_test.dat:
6.501454885037460052e+00,1.0
5.711892098470410239e+00,1.0
8.506213700179728221e+00,1.0
4.418215228904445624e+00,1.0
4.995308144087388769e+00,1.0
@lacava it's not trivial at all it's just that all of us are a bit busy. :( sorry that nobody looked into this yet... i'll try to look into it and see what could be done or at least what's the main issue. sorry again and thanks heaps for this awesome bug report!
I'm trying to do simple least angle regression on data that commonly has zero variance features that need to be pruned. After running an instance of
CPruneVarSubMean
andCNormOne
, I get an assertion error from thelars->train()
step when the preprocessing step prunes the feature set. The assertion error comes from Eigen's DenseBase.h and saysI have written a minimum working example here, adapted from the example least_angle_regression.cpp :
if I change the first feature of X to
I get a normal output: