usc-isi-i2 / dsbox-ta2

The DSBox TA2 component
MIT License
11 stars 6 forks source link

LL0_6332_cylinder_bands, LL0_301_ozone_level break our pipeline structure #155

Closed serbanstan closed 6 years ago

serbanstan commented 6 years ago

We are unable to find successful pipelines on this dataset. However, size doesn't seem to be the problem.

(dsbox-devel-710) [stan@dsbox01 python]$ python ta2-search /nas/home/stan/dsbox/runs2/config-ll0/LL0_6332_cylinder_bands_config.json 
Namespace(configuration_file='/nas/home/stan/dsbox/runs2/config-ll0/LL0_6332_cylinder_bands_config.json', cpus=-1, debug=False, output_prefix=None, timeout=-1)
Using configuation:
{'cpus': '10',
 'dataset_schema': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_6332_cylinder_bands/LL0_6332_cylinder_bands_dataset/datasetDoc.json',
 'executables_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_6332_cylinder_bands/executables',
 'pipeline_logs_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_6332_cylinder_bands/logs',
 'problem_root': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_6332_cylinder_bands/LL0_6332_cylinder_bands_problem',
 'problem_schema': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_6332_cylinder_bands/LL0_6332_cylinder_bands_problem/problemDoc.json',
 'ram': '10Gi',
 'temp_storage_root': '/nas/home/stan/dsbox/runs2/output-ll0/LL0_6332_cylinder_bands/temp',
 'timeout': 19,
 'training_data_root': '/nfs1/dsbox-repo/data/datasets-v31/training_datasets/LL0/LL0_6332_cylinder_bands/LL0_6332_cylinder_bands_dataset'}
[INFO] No test data config found! Will split the data.
[INFO] Template choices:
Template ' SRI_Mean_Baseline_Template ' has been added to template base.
Template ' default_classification_template ' has been added to template base.
Template ' default_text_classification_template ' has been added to template base.
Template ' Default_timeseries_collection_template ' has been added to template base.
Template ' dsbox_classification_template ' has been added to template base.
[INFO] Using Global Cache
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[INFO] Will use normal train-test mode ( n = 1 ) to choose best primitives.
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 8862230842380113799)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
[INFO] Testing finish.!!!
[INFO] Now in normal mode, will add extra train with train_dataset1
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 5692052847194071262)
[INFO] Now are training the pipeline with all dataset and saving the pipeline.
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 6387489125683967746)
!!!!!! TEST_DATASET1
{'cross_validation_metrics': [],
 'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f057e47f518>,
 'test_metrics': [{'column_name': 'band_type',
                   'metric': 'f1Macro',
                   'value': 0.3647058823529412}],
 'total_runtime': 29.0382297039032,
 'training_metrics': [{'column_name': 'band_type',
                       'metric': 'f1Macro',
                       'value': 0.367047308319739}]}
!!!!
[INFO] push@Candidate: (3779757731712582950,1dfbd95a-a531-4f1b-9ad4-ac27fcf8a983)
{'cross_validation_metrics': [],
 'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7f057e47f518>,
 'test_metrics': [{'column_name': 'band_type',
                   'metric': 'f1Macro',
                   'value': 0.3647058823529412}],
 'total_runtime': 29.0382297039032,
 'training_metrics': [{'column_name': 'band_type',
                       'metric': 'f1Macro',
                       'value': 0.367047308319739}]}
[INFO] Using Global Cache
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
[INFO] Will use cross validation( n = 10 ) to choose best primitives.
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', 5692052847194071262)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -3469096916287557651)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 9066572204459050990)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -1874491454481230817)
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', 2413907835440300144)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py:621: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', 6258074368309404413)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', -8302676664513276927)
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', 5382648654500800146)
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', 5037988628383687394)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKPCA', 8482212790481700249)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', -7052326104389439606)
[INFO] Push@cache: ('d3m.primitives.data.CastToType', 5090693617328421985)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKRandomForestClassifier', 3112002833650061931)
[INFO] Testing finish.!!!
[INFO] Now in normal mode, will add extra train with train_dataset1
[INFO] Hit@cache: ('d3m.primitives.dsbox.Denormalize', 5692052847194071262)
[INFO] Hit@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -3469096916287557651)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 9066572204459050990)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -1874491454481230817)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Profiler', 2413907835440300144)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', 6258074368309404413)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CorexText', -8302676664513276927)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Encoder', 5382648654500800146)
[INFO] Hit@cache: ('d3m.primitives.dsbox.MeanImputation', 5037988628383687394)
[INFO] Hit@cache: ('d3m.primitives.sklearn_wrap.SKPCA', 8482212790481700249)
[INFO] Hit@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', -7052326104389439606)
[INFO] Hit@cache: ('d3m.primitives.data.CastToType', 5090693617328421985)
[INFO] Hit@cache: ('d3m.primitives.sklearn_wrap.SKRandomForestClassifier', 3112002833650061931)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py:621: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
[INFO] Now are training the pipeline with all dataset and saving the pipeline.
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', 6387489125683967746)
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -4520796202379798502)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 7270534236577981231)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 9017461908300853567)
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', -7577823149435098899)
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -8988122307913698267)
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 7687565503992275606)
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', 2990934636288289629)
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', -8724358386827322438)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKPCA', -4399869666562517803)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 546, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 756, in _evaluate
    fitted_pipeline2.fit(cache=cache, inputs=[self.all_dataset])
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 204, in fit
    primitive_arguments
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 295, in _primitive_step_fit
    model.fit()
  File "/nfs1/dsbox-repo/stan/sklearn-wrap/sklearn_wrap/SKPCA.py", line 164, in fit
    self._clf.fit(self._training_inputs)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/decomposition/pca.py", line 329, in fit
    self._fit(X)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/decomposition/pca.py", line 370, in _fit
    copy=self.copy)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
[INFO] push@Candidate: (8575703548964151367,None)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 401, in setup_initial_candidate
    candidate.data.update(result)
TypeError: 'NoneType' object is not iterable
[ERROR] Initial Pipeline failed, Trying a random pipeline ...
{'cast_1_step': {'hyperparameters': {'type_to_cast': 'float'},
                 'primitive': 'd3m.primitives.data.CastToType'},
 'clean_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.dsbox.CleaningFeaturizer'},
 'corex_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.dsbox.CorexText'},
 'denormalize_step': {'hyperparameters': {},
                      'primitive': 'd3m.primitives.dsbox.Denormalize'},
 'dim_red_step': {'hyperparameters': {},
                  'primitive': 'd3m.primitives.sklearn_wrap.SKPCA'},
 'encoder_step': {'hyperparameters': {},
                  'primitive': 'd3m.primitives.dsbox.Encoder'},
 'extract_attribute_step': {'hyperparameters': {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)},
                            'primitive': 'd3m.primitives.data.ExtractColumnsBySemanticTypes'},
 'extract_target_step': {'hyperparameters': {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Target',
                                                                'https://metadata.datadrivendiscovery.org/types/SuggestedTarget')},
                         'primitive': 'd3m.primitives.data.ExtractColumnsBySemanticTypes'},
 'impute_step': {'hyperparameters': {},
                 'primitive': 'd3m.primitives.dsbox.MeanImputation'},
 'model_step': {'hyperparameters': {'bootstrap': True,
                                    'max_depth': 15,
                                    'max_features': 'auto',
                                    'min_samples_leaf': 1,
                                    'min_samples_split': 2,
                                    'n_estimators': 10},
                'primitive': 'd3m.primitives.sklearn_wrap.SKRandomForestClassifier'},
 'profiler_step': {'hyperparameters': {},
                   'primitive': 'd3m.primitives.dsbox.Profiler'},
 'scaler_step': {'hyperparameters': {},
                 'primitive': 'd3m.primitives.sklearn_wrap.SKMaxAbsScaler'},
 'to_dataframe_step': {'hyperparameters': {},
                       'primitive': 'd3m.primitives.datasets.DatasetToDataFrame'}}
--------------------
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
[INFO] Will use cross validation( n = 10 ) to choose best primitives.
[INFO] Hit@cache: ('d3m.primitives.dsbox.Denormalize', 5692052847194071262)
[INFO] Hit@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -3469096916287557651)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 9066572204459050990)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -1874491454481230817)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Profiler', 2413907835440300144)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', 6258074368309404413)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CorexText', -8302676664513276927)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Encoder', 5382648654500800146)
[INFO] Hit@cache: ('d3m.primitives.dsbox.MeanImputation', 5037988628383687394)
[INFO] Push@cache: ('d3m.primitives.dsbox.DoNothing', 8482212790481700249)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', -1847985700521651967)
[INFO] Push@cache: ('d3m.primitives.dsbox.DoNothing', 5671574285360898223)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKRandomForestClassifier', -5659209142612764594)
[INFO] Testing finish.!!!
[INFO] Now in normal mode, will add extra train with train_dataset1
[INFO] Hit@cache: ('d3m.primitives.dsbox.Denormalize', 5692052847194071262)
[INFO] Hit@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -3469096916287557651)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 9066572204459050990)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -1874491454481230817)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Profiler', 2413907835440300144)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', 6258074368309404413)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CorexText', -8302676664513276927)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Encoder', 5382648654500800146)
[INFO] Hit@cache: ('d3m.primitives.dsbox.MeanImputation', 5037988628383687394)
[INFO] Hit@cache: ('d3m.primitives.dsbox.DoNothing', 8482212790481700249)
[INFO] Hit@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', -1847985700521651967)
[INFO] Hit@cache: ('d3m.primitives.dsbox.DoNothing', 5671574285360898223)
[INFO] Hit@cache: ('d3m.primitives.sklearn_wrap.SKRandomForestClassifier', -5659209142612764594)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py:621: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/pandas/core/indexing.py:537: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
[INFO] Now are training the pipeline with all dataset and saving the pipeline.
[INFO] Hit@cache: ('d3m.primitives.dsbox.Denormalize', 6387489125683967746)
[INFO] Hit@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -4520796202379798502)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 7270534236577981231)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 9017461908300853567)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Profiler', -7577823149435098899)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -8988122307913698267)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CorexText', 7687565503992275606)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Encoder', 2990934636288289629)
[INFO] Hit@cache: ('d3m.primitives.dsbox.MeanImputation', -8724358386827322438)
[INFO] Push@cache: ('d3m.primitives.dsbox.DoNothing', -4399869666562517803)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', -1024010931849144111)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 546, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 756, in _evaluate
    fitted_pipeline2.fit(cache=cache, inputs=[self.all_dataset])
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 204, in fit
    primitive_arguments
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 298, in _primitive_step_fit
    produce_result = model.produce(**produce_params)
  File "/nfs1/dsbox-repo/stan/sklearn-wrap/sklearn_wrap/SKMaxAbsScaler.py", line 118, in produce
    output = clf.fit_transform(sk_inputs)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/base.py", line 517, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 810, in fit
    return self.partial_fit(X, y)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 827, in partial_fit
    estimator=self, dtype=FLOAT_DTYPES)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
[INFO] push@Candidate: (7460577459725286717,None)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 401, in setup_initial_candidate
    candidate.data.update(result)
TypeError: 'NoneType' object is not iterable
[ERROR] Initial Pipeline failed, Trying a random pipeline ...
kyao commented 6 years ago

A quick fix is to move CastToType float to before SKPCA, since SKPCA requires all numerical columns.

But, there is a subtle bug that needs to addressed in the future. Some runs of the same pipeline would succeed, while others would fail. On some runs the Encoder would encode a particular column, while other runs it would not. I think the difference is because on some runs we are using a subset of the dataset, while on other runs we would be using another subset or the entire dataset. And, those particular columns fall on the boundary of rule of being categorical.

serbanstan commented 6 years ago

I'm guessing your advice is for https://github.com/usc-isi-i2/dsbox-ta2/issues/157

kyao commented 6 years ago

I guess is for both. The reason all our of the pipelines fail for LL0_6332 is because dsbox_generic_steps allows str columns to get to SKPCA.

serbanstan commented 6 years ago

Unfortunately this issue isn't solved by swapping PCA and cast_to_type. The reason is that sometimes SKMaxAbsScaler also breaks, with a similar error.

[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', 2889272456499411198)
shape:  (540, 39)
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -6995438698222877734)
shape:  (540, 40)
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 5801274934117132541)
shape:  (540, 49)
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', 2401217275340853352)
shape:  (540, 138)
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', 655403304700757436)
shape:  (540, 138)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', 2018003916215478763)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 552, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 762, in _evaluate
    fitted_pipeline2.fit(cache=cache, inputs=[self.all_dataset])
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 209, in fit
    primitive_arguments
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 303, in _primitive_step_fit
    produce_result = model.produce(**produce_params)
  File "/nfs1/dsbox-repo/stan/sklearn-wrap/sklearn_wrap/SKMaxAbsScaler.py", line 118, in produce
    output = clf.fit_transform(sk_inputs)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/base.py", line 517, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 810, in fit
    return self.partial_fit(X, y)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 827, in partial_fit
    estimator=self, dtype=FLOAT_DTYPES)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)
  File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
kyao commented 6 years ago

Can you print out the metadata associated with the dataframe? You can print the metadata by using dataframe.metadata.pretty_print().

serbanstan commented 6 years ago

I'm attaching a log file with the output with from the run. It's too long to post here.

log.txt

This output can be reproduced by adding the following code at line 182 in runtime.py

                try:
                    print("step: ", i)
                    print("shape: ", primitive_arguments['inputs'].shape)
                    print(primitive_arguments['inputs'].metadata.pretty_print())
                except:
                    print("shape: N/A")
serbanstan commented 6 years ago

It seems the error stems from the following behavior. Profiler adds NaNs, and no primitive coming after it in the pipeline is able to eliminate them.

[INFO] Now are training the pipeline with all dataset and saving the pipeline.
step:  0
shape: N/A
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', 367880002551074194)
step:  1
shape: N/A
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', 2027443791237794836)
step:  2
shape:  (540, 41)
nans in df:  False
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 998975731857451977)
step:  3
shape:  (540, 41)
nans in df:  False
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 8174239061367614036)
step:  4
shape:  (540, 39)
nans in df:  False
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', -4156860004676617922)
step:  5
shape:  (540, 39)
nans in df:  True
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -286274654852111623)
step:  6
shape:  (540, 40)
nans in df:  True
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 5354787078332680397)
step:  7
shape:  (540, 49)
nans in df:  True
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', -5935975047104356091)
step:  8
shape:  (540, 138)
nans in df:  True
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', 1477686608655536678)
step:  9
shape:  (540, 138)
nans in df:  True
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', 8867867557532755693)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 552, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
kyao commented 6 years ago

Can you point me to your output/log directory?

serbanstan commented 6 years ago

/nas/home/stan/dsbox/runs2/output-ll0/LL0_6332_cylinder_bands

kyao commented 6 years ago

It's a bug in the profiler. It determines that the caliper column is a float. It sets the semantic type float, but in the process it removes its semantic type attribute

   (__ALL_ELEMENTS__, 22)
  Metadata:
   {
    "name": "caliper",
    "structural_type": "float",
    "semantic_types": [
     "http://schema.org/Float"
    ],
}

With the attribute semantic type the imputer will not operate on that column. It's too late to change the profiler. But a work around is to modify runtime.py. Whenever we see a semantic type float, we want to make sure it also has a type attribute

    "semantic_types": [
     "http://schema.org/Float",
     "https://metadata.datadrivendiscovery.org/types/Attribute"
    ]
RqS commented 6 years ago

That is my problem in profiler. Fixed in profiler: https://github.com/usc-isi-i2/dsbox-cleaning/blob/961d92886916dfbc0a0e1bfd2a51e9c4677301f7/dsbox/datapreprocessing/cleaner/data_profile.py#L345

I am working on adding the work around

RqS commented 6 years ago

Some column just fall on the boundary of detecting if it is categorical data or not, need to consider the threshold later.

For the bug in profiler, added a work around way in runtime.