Corex bug - Githubissues

kyao commented 6 years ago
Namespace(configuration_file='/dsbox_efs/config/seed-41/partition-38/27_wordLevels/search_config.json', cpus=10, debug=False, output_prefix=None, timeout=55)
Using configuation:
{'cpus': 10,
 'dataset_schema': '/dsbox_efs/dataset/seed_datasets_current/27_wordLevels/27_wordLevels_dataset/datasetDoc.json',
 'executables_root': '/dsbox_efs/runs/seed/27_wordLevels/executables',
 'pipeline_logs_root': '/dsbox_efs/runs/seed/27_wordLevels/pipelines',
 'problem_root': '/dsbox_efs/dataset/seed_datasets_current/27_wordLevels/27_wordLevels_problem',
 'problem_schema': '/dsbox_efs/dataset/seed_datasets_current/27_wordLevels/27_wordLevels_problem/problemDoc.json',
 'temp_storage_root': '/dsbox_efs/runs/seed/27_wordLevels/supporting_files',
 'timeout': 55,
 'training_data_root': '/dsbox_efs/dataset/seed_datasets_current/27_wordLevels/27_wordLevels_dataset',
 'user_problems_root': '/dsbox_efs/runs/seed/27_wordLevels/user_problems'}
[INFO] No test data config found! Will split the data.
[INFO] - dsbox.controller.controller - Top level output directory: /dsbox_efs/runs/seed/27_wordLevels
[INFO] Template choices:
Template ' SRI_Mean_Baseline_Template ' has been added to template base.
Template ' Default_classification_template ' has been added to template base.
Template ' random_forest_template ' has been added to template base.
Template ' DSBox_classification_template ' has been added to template base.
Template ' TA1Classification_3 ' has been added to template base.
Template ' MuxinTA1ClassificationTemplate1 ' has been added to template base.
Template ' TA1_classification_template_1 ' has been added to template base.
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] Template 0:SRI_Mean_Baseline_Template Selected. UCT:[None, None, None, None, None, None, None]
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
/usr/local/lib/python3.6/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it wi\
ll be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
[INFO] Will use normal train-test mode ( n = 1 ) to choose best primitives.
[INFO] Push@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 330104127224707127)
[INFO] Testing finish.!!!
[INFO] Now in normal mode, will add extra train with train_dataset1
[INFO] Hit@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 330104127224707127)
[INFO] Now are training the pipeline with all dataset and saving the pipeline.
[INFO] Hit@cache: ('d3m.primitives.sri.baseline.MeanBaseline', 330104127224707127)
[INFO] push@Candidate: (-9117092613766624861,f05092f9-dac5-4619-9516-3e3e82793e75)
[INFO] - dsbox.controller.controller - ******************
[INFO] Writing results
{'cross_validation_metrics': [],
 'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7ff351ee26d8>,
 'test_metrics': [{'metric': 'f1Macro', 'value': 0.056815859390966686}],
 'total_runtime': 222.65693306922913,
 'training_metrics': [{'metric': 'f1Macro', 'value': 0.05663378937321031}]}
[INFO] - dsbox.controller.controller - {'fitted_pipeline': <dsbox.pipeline.fitted_pipeline.FittedPipeline object at 0x7ff351ee26d8>, 'training_metrics': [{'metric': 'f1Macro', 'value':\
 0.05663378937321031}], 'cross_validation_metrics': [], 'test_metrics': [{'metric': 'f1Macro', 'value': 0.056815859390966686}], 'total_runtime': 222.65693306922913} 0.05681585939096668\
6
[INFO] - dsbox.controller.controller - Training f1Macro = 0.05663378937321031
[INFO] - dsbox.controller.controller - Validation f1Macro = 0.056815859390966686
[INFO] - dsbox.controller.controller - ******************
[INFO] Saving training results in /dsbox_efs/runs/seed/27_wordLevels.txt
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] report: 0.056815859390966686
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] UCT updated: [35.25872776860654, 122.42876838666079, 122.42876838666079, 122.42876838666079, 122.42876838666079, 122.428768386\
66079, 122.42876838666079]
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] cache size: 1, candidates: 1
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] Template 1:Default_classification_template Selected. UCT:[35.25872776860654, 122.42876838666079, 122.42876838666079, 122.42876\
838666079, 122.42876838666079, 122.42876838666079, 122.42876838666079]
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
[INFO] Will use cross validation( n = 10 ) to choose best primitives.
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', 330104127224707127)
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', 330104127224707127)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 3499164951153174769)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 3499164951153174769)
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 8575630062044293167)
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', -2508346656682011289)
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -2508346656682011289)
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 7054889463992538254)
Traceback (most recent call last):
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/search.py", line 523, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/search.py", line 546, in _evaluate
    fitted_pipeline.fit(cache=cache, inputs=[self.train_dataset1])
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/runtime.py", line 195, in fit
    primitive_arguments
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/runtime.py", line 281, in _primitive_step_fit
    model.fit()
  File "/src/dsbox-corex/corex_text.py", line 176, in fit
    bow = self.bow.fit_transform(map(self._get_ngrams, concat_cols.ravel()))
  File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 890, in fit_transform
    max_features)
  File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 771, in _limit_features
    raise ValueError("After pruning, no terms remain. Try a lower"
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
serbanstan commented 6 years ago
This error can be avoided by setting the min_df hyperparemeter to 0 (default is .02).
serbanstan commented 6 years ago
Added a try/catch in the primitive code as well and updated the corex repository. Should I create a merge request ? If not feel free to close this issue.
serbanstan commented 6 years ago
Adding to primitive repo and closing this issue.
usc-isi-i2 / dsbox-ta2

Corex bug #115