usc-isi-i2 / dsbox-ta2

The DSBox TA2 component
MIT License
11 stars 6 forks source link

Error in cast_to_type : float after using PCA #157

Closed serbanstan closed 6 years ago

serbanstan commented 6 years ago

Adding PCA in our pipeline, with a set number of components makes our cast_to_type primitive break.

(dsbox-devel-710) [stan@dsbox01 python]$ python ta2-search /nas/home/stan/dsbox/runs2/config-ll0/LL0_uci_facebook_metrics_config.json 
Namespace(configuration_file='/nas/home/stan/dsbox/runs2/config-ll0/LL0_uci_facebook_metrics_config.json', cpus=-1, debug=False, output_prefix=None, timeout=-1)
Using configuation:
{'cpus': '10',
 'dataset_schema': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_dataset/datasetDoc.json',
 'executables_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics/executables',
 'pipeline_logs_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics/logs',
 'problem_root': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_problem',
 'problem_schema': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_problem/problemDoc.json',
 'ram': '10Gi',
 'temp_storage_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics/temp',
 'timeout': 19,
 'training_data_root': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_uci_facebook_metrics/LL0_uci_facebook_metrics_dataset'}
[INFO] No test data config found! Will split the data.
[INFO] - dsbox.controller.controller - Top level output directory: /nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics
[INFO] Template choices:
Template ' SRI_Mean_Baseline_Template ' has been added to template base.
Template ' default_regression_template ' has been added to template base.
Template ' default_text_regression_template ' has been added to template base.
Template ' UU3_Test_Template ' has been added to template base.
Template ' Default_timeseries_regression_template ' has been added to template base.
Template ' regression_with_feature_selection ' has been added to template base.
Template ' dsbox_regression_template ' has been added to template base.
[INFO] - dsbox.controller.controller - [INFO] Template 0:SRI_Mean_Baseline_Template Selected. UCT:[None, None, None, None, None, None, None]
[INFO] - dsbox.controller.controller - Searching template SRI_Mean_Baseline_Template
[INFO] - dsbox.controller.controller - cache size = 0
[INFO] Using Global Cache
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[INFO] Will use normal train-test mode ( n = 1 ) to choose best primitives.
shape N/A
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 523735248979574436)
[INFO] Testing finish.!!!
[INFO] Now in normal mode, will add extra train with train_dataset1
shape N/A
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', -8667387790502098007)
[INFO] Now are training the pipeline with all dataset and saving the pipeline.
shape N/A
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', -6906265235819350)
!!!!!! TEST_DATASET1
{'cross_validation_metrics': [],
 'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f305a7b0390>,
 'test_metrics': [{'column_name': 'Page_total_likes_target',
                   'metric': 'meanSquaredError',
                   'value': 324835865.8209}],
 'total_runtime': 29.177137851715088,
 'training_metrics': [{'column_name': 'Page_total_likes_target',
                       'metric': 'meanSquaredError',
                       'value': 248196403.78777778}]}
!!!!
[INFO] push@Candidate: (-2274051079072489220,f81dabd3-b61a-4687-a121-2d4546d2139b)
[INFO] - dsbox.controller.controller - ******************
[INFO] Writing results
{'cross_validation_metrics': [],
 'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f305a7b0390>,
 'test_metrics': [{'column_name': 'Page_total_likes_target',
                   'metric': 'meanSquaredError',
                   'value': 324835865.8209}],
 'total_runtime': 29.177137851715088,
 'training_metrics': [{'column_name': 'Page_total_likes_target',
                       'metric': 'meanSquaredError',
                       'value': 248196403.78777778}]}
[INFO] - dsbox.controller.controller - {'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f305a7b0390>, 'training_metrics': [{'column_name': 'Page_total_likes_target', 'metric': 'meanSquaredError', 'value': 248196403.78777778}], 'cross_validation_metrics': [], 'test_metrics': [{'column_name': 'Page_total_likes_target', 'metric': 'meanSquaredError', 'value': 324835865.8209}], 'total_runtime': 29.177137851715088} 324835865.8209
[INFO] - dsbox.controller.controller - Training meanSquaredError = 248196403.78777778
[INFO] - dsbox.controller.controller - Validation meanSquaredError = 324835865.8209
[INFO] - dsbox.controller.controller - [INFO] report: 324835865.8209
[INFO] - dsbox.controller.controller - [INFO] UCT updated: [10.348094163295801, 111.17835653996396, 111.17835653996396, 111.17835653996396, 111.17835653996396, 111.17835653996396, 111.17835653996396]
[INFO] - dsbox.controller.controller - [INFO] cache size: 3, candidates: 1
[INFO] - dsbox.controller.controller - [INFO] New Best Value: 324835865.8209
[INFO] - dsbox.controller.controller - ******************
[INFO] Saving training results in /nas/home/stan/dsbox/runs2/output-ll0/LL0_uci_facebook_metrics.txt
[INFO] - dsbox.controller.controller - [INFO] Template 1:default_regression_template Selected. UCT:[10.348094163295801, 111.17835653996396, 111.17835653996396, 111.17835653996396, 111.17835653996396, 111.17835653996396, 111.17835653996396]
[INFO] - dsbox.controller.controller - Searching template default_regression_template
[INFO] - dsbox.controller.controller - cache size = 3
[INFO] Using Global Cache
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
[INFO] Will use cross validation( n = 10 ) to choose best primitives.
shape N/A
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', -8667387790502098007)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
shape N/A
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -6784794392826866445)
(400, 20)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -5361314593299247151)
(400, 20)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -8917049308239181810)
(400, 17)
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', -1730383627972381479)
(400, 17)
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', 5634579432155763573)
(400, 17)
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 2857920993194579986)
(400, 17)
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', 7949591534899160329)
(400, 45)
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', 4704891850192963082)
(400, 45)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', -5540225622402709454)
(400, 45)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKPCA', -8561610737237944930)
(400, 5)
[INFO] Push@cache: ('d3m.primitives.data.CastToType', -1444505970168401487)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 552, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 575, in _evaluate
    fitted_pipeline.fit(cache=cache, inputs=[self.train_dataset1])
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 210, in fit
    primitive_arguments
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 304, in _primitive_step_fit
    produce_result = model.produce(**produce_params)
  File "/nfs1/dsbox-repo/stan/common-primitives/common_primitives/cast_to_type.py", line 80, in produce
    outputs = inputs.iloc[:, columns_to_use].astype(type_to_cast)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py", line 1367, in __getitem__
    return self._getitem_tuple(key)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py", line 1737, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py", line 204, in _has_valid_tuple
    if not self._has_valid_type(k, i):
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py", line 1674, in _has_valid_type
    return self._is_valid_list_like(key, axis)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py", line 1731, in _is_valid_list_like
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds
serbanstan commented 6 years ago

Seems to have been solved. This datasets is however still a problem, see https://github.com/usc-isi-i2/dsbox-ta2/issues/159 . Closing.