usc-isi-i2 / dsbox-ta2

The DSBox TA2 component
MIT License
11 stars 6 forks source link

Profiler breaks on LL0_690 #83

Closed serbanstan closed 6 years ago

serbanstan commented 6 years ago

Running python ta2-search /nas/home/stan/dsbox/runs2/config-ll0/LL0_690_visualizing_galaxy_config.json breaks the profiler. The chosen template is default_regression_template.

(dsbox-devel-710) [stan@dsbox01 python]$ python ta2-search /nas/home/stan/dsbox/runs2/config-ll0/LL0_690_visualizing_galaxy_config.json 
Namespace(configuration_file='/nas/home/stan/dsbox/runs2/config-ll0/LL0_690_visualizing_galaxy_config.json', cpus=-1, debug=False, output_prefix=None, timeout=-1)
Using configuation:
{'cpus': '10',
 'dataset_schema': '/nfs1/dsbox-repo/data/datasets/training_datasets/LL0/LL0_690_visualizing_galaxy/LL0_690_visualizing_galaxy_dataset/datasetDoc.json',
 'executables_root': '/nfs1/dsbox-repo/stan/dsbox-ta2/python/output/LL0_690_visualizing_galaxy/executables',
 'pipeline_logs_root': '/nfs1/dsbox-repo/stan/dsbox-ta2/python/output/LL0_690_visualizing_galaxy/logs',
 'problem_root': '/nfs1/dsbox-repo/data/datasets/training_datasets/LL0/LL0_690_visualizing_galaxy/LL0_690_visualizing_galaxy_problem',
 'problem_schema': '/nfs1/dsbox-repo/data/datasets/training_datasets/LL0/LL0_690_visualizing_galaxy/LL0_690_visualizing_galaxy_problem/problemDoc.json',
 'ram': '10Gi',
 'saved_pipeline_ID': '',
 'saving_folder_loc': '/nfs1/dsbox-repo/stan/dsbox-ta2/python/output/LL0_690_visualizing_galaxy',
 'temp_storage_root': '/nfs1/dsbox-repo/stan/dsbox-ta2/python/output/LL0_690_visualizing_galaxy/temp',
 'timeout': 9,
 'training_data_root': '/nfs1/dsbox-repo/data/datasets/training_datasets/LL0/LL0_690_visualizing_galaxy/LL0_690_visualizing_galaxy_dataset'}
[INFO] No test data config found! Will split the data.
[INFO] - dsbox.controller.controller - Top level output directory: /nfs1/dsbox-repo/stan/dsbox-ta2/python/output/LL0_690_visualizing_galaxy
[INFO] Succesfully parsed test data
{'structural_type': <class 'd3m.container.pandas.DataFrame'>, 'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Table', 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint'), 'dimension': {'name': 'rows', 'semantic_types': ('https://metadata.datadrivendiscovery.org/types/TabularRow',), 'length': 223}}
{'dimension': <FrozenOrderedDict OrderedDict([('name', 'rows'), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/TabularRow',)), ('length', 223)])>,
 'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Table',
                    'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint'),
 'structural_type': <class 'd3m.container.pandas.DataFrame'>}
{'structural_type': <class 'd3m.container.pandas.DataFrame'>, 'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Table', 'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint'), 'dimension': {'name': 'rows', 'semantic_types': ('https://metadata.datadrivendiscovery.org/types/TabularRow',), 'length': 100}}
{'dimension': <FrozenOrderedDict OrderedDict([('name', 'rows'), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/TabularRow',)), ('length', 100)])>,
 'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Table',
                    'https://metadata.datadrivendiscovery.org/types/DatasetEntryPoint'),
 'structural_type': <class 'd3m.container.pandas.DataFrame'>}
[INFO] Template choices:
Template ' Default_regression_template ' has been added to template base.
[INFO] Template 0:Default_regression_template Selected. UCT:[100.0]
[INFO] Worker started, id: <_MainProcess(MainProcess, started)>
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', 1691920072713186883)
/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', 1691920072713186883)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 4720874274637968185)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -4444265286283118903)
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', 7282301522344053085)
/nfs1/dsbox-repo/stan/dsbox-profiling/dsbox/datapreprocessing/profiler/dependencies/date_extractor.py:408: UserWarning: DateExtractor: Failed to set timezone as America/Los_Angeles. Catch offset must be a timedelta representing a whole number of minutes, not datetime.timedelta(-1, 58022).
  warn('DateExtractor: Failed to set timezone as ' + str(self.default_tz) + '. Catch ' + str(e))
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 420, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 439, in _evaluate
    fitted_pipeline.fit(cache=cache, inputs=[self.train_dataset])
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 92, in fit
    self.runtime.fit(**arguments)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 193, in fit
    primitive_arguments
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 281, in _primitive_step_fit
    produce_result = model.produce(**produce_params)
  File "/nfs1/dsbox-repo/stan/dsbox-profiling/dsbox/datapreprocessing/profiler/data_profile.py", line 175, in produce
    cols = self._DateFeaturizer.detect_date_columns(self._sample_df)
  File "/nfs1/dsbox-repo/stan/dsbox-profiling/dsbox/datapreprocessing/profiler/date_featurizer_org.py", line 99, in detect_date_columns
    if self._parse_column(sampled_df, idx) is not None:
  File "/nfs1/dsbox-repo/stan/dsbox-profiling/dsbox/datapreprocessing/profiler/date_featurizer_org.py", line 302, in _parse_column
    warn("Warning: multiple dates detected in column: " + idx)
TypeError: must be str, not int
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 310, in setup_initial_candidate
    candidate.data.update(result)
TypeError: 'NoneType' object is not iterable
[ERROR] Initial Pipeline failed, Trying a random pipeline ...
{'clean_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.dsbox.CleaningFeaturizer'},
 'corex_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.dsbox.CorexText'},
 'denormalize_step': {'hyperparameters': {},
                      'primitive': 'd3m.primitives.dsbox.Denormalize'},
 'encoder_step': {'hyperparameters': {},
                  'primitive': 'd3m.primitives.dsbox.Encoder'},
 'extract_attribute_step': {'hyperparameters': {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)},
                            'primitive': 'd3m.primitives.data.ExtractColumnsBySemanticTypes'},
 'extract_target_step': {'hyperparameters': {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Target',
                                                                'https://metadata.datadrivendiscovery.org/types/SuggestedTarget')},
                         'primitive': 'd3m.primitives.data.ExtractColumnsBySemanticTypes'},
 'impute_step': {'hyperparameters': {},
                 'primitive': 'd3m.primitives.sklearn_wrap.SKImputer'},
 'model_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.sklearn_wrap.SKRidge'},
 'profiler_step': {'hyperparameters': {},
                   'primitive': 'd3m.primitives.dsbox.Profiler'},
 'to_dataframe_step': {'hyperparameters': {},
                       'primitive': 'd3m.primitives.datasets.DatasetToDataFrame'}}
--------------------
[INFO] Worker started, id: <_MainProcess(MainProcess, started)>
[INFO] Hit@cache: ('d3m.primitives.dsbox.Denormalize', 1691920072713186883)
[INFO] Hit@cache: ('d3m.primitives.datasets.DatasetToDataFrame', 1691920072713186883)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 4720874274637968185)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -4444265286283118903)
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', 7282301522344053085)
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 420, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 439, in _evaluate
    fitted_pipeline.fit(cache=cache, inputs=[self.train_dataset])
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 92, in fit
    self.runtime.fit(**arguments)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 193, in fit
    primitive_arguments
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 281, in _primitive_step_fit
    produce_result = model.produce(**produce_params)
  File "/nfs1/dsbox-repo/stan/dsbox-profiling/dsbox/datapreprocessing/profiler/data_profile.py", line 175, in produce
    cols = self._DateFeaturizer.detect_date_columns(self._sample_df)
  File "/nfs1/dsbox-repo/stan/dsbox-profiling/dsbox/datapreprocessing/profiler/date_featurizer_org.py", line 99, in detect_date_columns
    if self._parse_column(sampled_df, idx) is not None:
  File "/nfs1/dsbox-repo/stan/dsbox-profiling/dsbox/datapreprocessing/profiler/date_featurizer_org.py", line 302, in _parse_column
    warn("Warning: multiple dates detected in column: " + idx)
TypeError: must be str, not int
Traceback (most recent call last):
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 310, in setup_initial_candidate
    candidate.data.update(result)
TypeError: 'NoneType' object is not iterable
[ERROR] Initial Pipeline failed, Trying a random pipeline ...
{'clean_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.dsbox.CleaningFeaturizer'},
 'corex_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.dsbox.CorexText'},
 'denormalize_step': {'hyperparameters': {},
                      'primitive': 'd3m.primitives.dsbox.Denormalize'},
 'encoder_step': {'hyperparameters': {},
                  'primitive': 'd3m.primitives.dsbox.Encoder'},
 'extract_attribute_step': {'hyperparameters': {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Attribute',)},
                            'primitive': 'd3m.primitives.data.ExtractColumnsBySemanticTypes'},
 'extract_target_step': {'hyperparameters': {'semantic_types': ('https://metadata.datadrivendiscovery.org/types/Target',
                                                                'https://metadata.datadrivendiscovery.org/types/SuggestedTarget')},
                         'primitive': 'd3m.primitives.data.ExtractColumnsBySemanticTypes'},
 'impute_step': {'hyperparameters': {},
                 'primitive': 'd3m.primitives.sklearn_wrap.SKImputer'},
 'model_step': {'hyperparameters': {},
                'primitive': 'd3m.primitives.sklearn_wrap.SKRidge'},
 'profiler_step': {'hyperparameters': {},
                   'primitive': 'd3m.primitives.dsbox.Profiler'},
 'to_dataframe_step': {'hyperparameters': {},
                       'primitive': 'd3m.primitives.datasets.DatasetToDataFrame'}}
--------------------
Traceback (most recent call last):
  File "ta2-search", line 141, in <module>
    result = main(args)
  File "ta2-search", line 110, in main
    status = controller.train()
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/controller/controller.py", line 535, in train
    template, candidate=self.exec_history.iloc[idx]['candidate'], cache=cache)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/controller/controller.py", line 371, in search_template
    candidate_in=candidate, cache=cache)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 145, in search_one_iter
    self.setup_initial_candidate(candidate_in, cache)
  File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 319, in setup_initial_candidate
    raise ValueError("Invalid initial candidate")
ValueError: Invalid initial candidate
serbanstan commented 6 years ago

Fixed with new commit of dsbox-cleaning.