usc-isi-i2 / dsbox-ta2

The DSBox TA2 component
MIT License
11 stars 6 forks source link

feature selection template breaks on some datasets #158

Closed serbanstan closed 6 years ago

serbanstan commented 6 years ago

Running python ta2-search /nas/home/stan/dsbox/runs2/config-ll0/LL0_uci_facebook_metrics_config.json

(dsbox-devel-710) [stan@dsbox01 python]$ python ta2-search /nas/home/stan/dsbox/runs2/config-ll0/LL0_uci_facebook_metrics_config.json 
Namespace(configuration_file='/nas/home/stan/dsbox/runs2/config-ll0/LL0_uci_facebook_metrics_config.json', cpus=-1, debug=False, output_prefix=None, timeout=-1)
Using configuation:
{'cpus': '10',
 'dataset_schema': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_dataset/datasetDoc.json',
 'executables_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics/executables',
 'pipeline_logs_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics/logs',
 'problem_root': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_problem',
 'problem_schema': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_problem/problemDoc.json',
 'ram': '10Gi',
 'temp_storage_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics/temp',
 'timeout': 19,
 'training_data_root': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_dataset'}
[INFO] No test data config found! Will split the data.
[INFO] - dsbox.controller.controller - Top level output directory: /nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics
[INFO] Template choices:
Template ' SRI_Mean_Baseline_Template ' has been added to template base.
Template ' regression_with_feature_selection ' has been added to template base.
Template ' default_regression_template ' has been added to template base.
Template ' default_text_regression_template ' has been added to template base.
Template ' UU3_Test_Template ' has been added to template base.
Template ' Default_timeseries_regression_template ' has been added to template base.
Template ' dsbox_regression_template ' has been added to template base.
[INFO] - dsbox.controller.controller - [INFO] Template 0:SRI_Mean_Baseline_Template Selected. UCT:[None, None, None, None, None, None, None]
[INFO] - dsbox.controller.controller - Searching template SRI_Mean_Baseline_Template
[INFO] - dsbox.controller.controller - cache size = 0
[INFO] Using Global Cache
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[INFO] Will use normal train-test mode ( n = 1 ) to choose best primitives.
shape N/A
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', -4855218410049917836)
[INFO] Testing finish.!!!
[INFO] Now in normal mode, will add extra train with train_dataset1
shape N/A
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 7661347959384097990)
[INFO] Now are training the pipeline with all dataset and saving the pipeline.
shape N/A
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', -7966692816938464637)
!!!!!! TEST_DATASET1
{'cross_validation_metrics': [],
 'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f307bc5e8d0>,
 'test_metrics': [{'column_name': 'Page_total_likes_target',
                   'metric': 'meanSquaredError',
                   'value': 324835865.8209}],
 'total_runtime': 29.75954556465149,
 'training_metrics': [{'column_name': 'Page_total_likes_target',
                       'metric': 'meanSquaredError',
                       'value': 248196403.78777778}]}
!!!!
[INFO] push@Candidate: (-8339109057533818947,a7f969b5-b62c-4177-86c6-6c6a0171b670)
[INFO] - dsbox.controller.controller - ******************
[INFO] Writing results
{'cross_validation_metrics': [],
 'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f307bc5e8d0>,
 'test_metrics': [{'column_name': 'Page_total_likes_target',
                   'metric': 'meanSquaredError',
                   'value': 324835865.8209}],
 'total_runtime': 29.75954556465149,
 'training_metrics': [{'column_name': 'Page_total_likes_target',
                       'metric': 'meanSquaredError',
                       'value': 248196403.78777778}]}
[INFO] - dsbox.controller.controller - {'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f307bc5e8d0>, 'training_metrics': [{'column_name': 'Page_total_likes_target', 'metric': 'meanSquaredError', 'value': 248196403.78777778}], 'cross_validation_metrics': [], 'test_metrics': [{'column_name': 'Page_total_likes_target', 'metric': 'meanSquaredError', 'value': 324835865.8209}], 'total_runtime': 29.75954556465149} 324835865.8209
[INFO] - dsbox.controller.controller - Training meanSquaredError = 248196403.78777778
[INFO] - dsbox.controller.controller - Validation meanSquaredError = 324835865.8209
[INFO] - dsbox.controller.controller - [INFO] report: 324835865.8209
[INFO] - dsbox.controller.controller - [INFO] UCT updated: [10.37492709536642, 111.44668586067014, 111.44668586067014, 111.44668586067014, 111.44668586067014, 111.44668586067014, 111.44668586067014]
[INFO] - dsbox.controller.controller - [INFO] cache size: 3, candidates: 1
[INFO] - dsbox.controller.controller - [INFO] New Best Value: 324835865.8209
[INFO] - dsbox.controller.controller - ******************
[INFO] Saving training results in /nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics.txt
[INFO] - dsbox.controller.controller - [INFO] Template 1:regression_with_feature_selection Selected. UCT:[10.37492709536642, 111.44668586067014, 111.44668586067014, 111.44668586067014, 111.44668586067014, 111.44668586067014, 111.44668586067014]
[INFO] - dsbox.controller.controller - Searching template regression_with_feature_selection
[INFO] - dsbox.controller.controller - cache size = 3
[INFO] Using Global Cache
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
[INFO] Will use normal train-test mode ( n = 1 ) to choose best primitives.
shape N/A
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', -4855218410049917836)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
shape N/A
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -3113511419700388912)
2 (360, 20)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -6681729926407891142)
3 (360, 20)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -6605552540401501271)
4 (360, 17)
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', -1856769840982695180)
5 (360, 17)
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', 1974041053453766318)
6 (360, 17)
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', 7040280347136615399)
7 (360, 45)
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', -5235507793592112049)
8 (360, 45)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKLasso', 3083280521510108968)
/nfs1/dsbox-repo/stan/sklearn-wrap/sklearn_wrap/SKLasso.py:192: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator
  self._clf.fit(self._training_inputs, sk_training_output)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:477: UserWarning: Coordinate descent with no regularization may lead to unexpected results and is discouraged.
  positive)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 552, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 608, in _evaluate
    fitted_pipeline.fit(cache=cache, inputs=[self.train_dataset2[each_repeat]])
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 218, in fit
    primitives_outputs[n_step].copy(), model)
  File "<string>", line 2, in __setitem__
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/multiprocessing/managers.py", line 756, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/nfs1/dsbox-repo/stan/d3m/d3m/primitive_interfaces/base.py", line 710, in __getstate__
    'params': self.get_params(),
  File "/nfs1/dsbox-repo/stan/sklearn-wrap/sklearn_wrap/SKLasso.py", line 224, in get_params
    target_names_=self._target_names
  File "/nfs1/dsbox-repo/stan/d3m/d3m/metadata/params.py", line 81, in __init__
    raise exceptions.InvalidArgumentTypeError("Value '{value}' for parameter '{name}' is not an instance of the type: {value_type}".format(value=value, name=name, value_type=value_type))
d3m.exceptions.InvalidArgumentTypeError: Value '[100, 100]' for parameter 'n_iter_' is not an instance of the type: typing.Union[int, NoneType]
[INFO] push@Candidate: (-3907683042653612828,None)
liangmuxin commented 6 years ago

Now I am removing lasso so the only selection is select by percentile. But I also find that this dataset has two suggested columns and we are only using one in problem. This issue can also happens for other datasets. Solution: always using this way to extract target: { "name": "extract_target_step", "primitives": [{ "primitive": "d3m.primitives.data.ExtractColumnsBySemanticTypes", "hyperparameters": { 'semantic_types': ( 'https://metadata.datadrivendiscovery.org/types/TrueTarget',), 'use_columns': (), 'exclude_columns': () } }], "inputs": ["to_dataframe_step"] },

liangmuxin commented 6 years ago

Will closing this and put a new issue for this problem