Corex problem - Githubissues

kyao commented 6 years ago

See /dsbox_efs/runs/seed-2018-07-26-02:04/uu2_gp_hyperparameter_estimation/supporting_files/logs/out.txt

[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] report: -inf
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] UCT updated: [15.410311194367837, 59.01650930472471, 150.70969363048872, 150.70969363048872]
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] cache size: 7, candidates: 2
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] Template 2:DSBox_regression_template Selected. UCT:[15.410311194367837, 59.01650930472471, 150.70969363048872, 150.70969363048872]
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
[INFO] Will use cross validation( n = 10 ) to choose best primitives.
[INFO] Hit@cache: ('d3m.primitives.dsbox.Denormalize', -6233746033504171510)
[INFO] Hit@cache: ('d3m.primitives.datasets.DatasetToDataFrame', -6233746033504171510)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -5487362993119773801)
[INFO] Hit@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', -7790166322985162097)
[INFO] Hit@cache: ('d3m.primitives.dsbox.Profiler', -442430666729700080)
[INFO] Hit@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -442430666729700080)
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 7673547741426200079)
Traceback (most recent call last):
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/search.py", line 523, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/search.py", line 546, in _evaluate
    fitted_pipeline.fit(cache=cache, inputs=[self.train_dataset1])
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/runtime.py", line 195, in fit
    primitive_arguments
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/runtime.py", line 281, in _primitive_step_fit
    model.fit()
  File "/src/dsbox-corex/corex_text.py", line 176, in fit
    bow = self.bow.fit_transform(map(self._get_ngrams, concat_cols.ravel()))
  File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 890, in fit_transform
    max_features)
  File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 771, in _limit_features
    raise ValueError("After pruning, no terms remain. Try a lower"
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
[INFO] push@Candidate: (-7762801795229629775,None)

And,

[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] report: -inf
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] UCT updated: [16.21413539271983, 60.07383045417889, 49.50820764525503, 156.49570458608625]
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] cache size: 7, candidates: 4
[INFO] - dsbox.controller.controller - ^[[30m^[[42m[INFO] Template 3:UU3_Test_Template Selected. UCT:[16.21413539271983, 60.07383045417889, 49.50820764525503, 156.49570458608625]
[INFO] Worker started, id: <_MainProcess(MainProcess, started)> , True
[INFO] Will use normal train-test mode ( n = 1 ) to choose best primitives.
[INFO] Push@cache: ('d3m.primitives.dsbox.MultiTableFeaturization', -6233746033504171510)
Traceback (most recent call last):
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/search.py", line 523, in evaluate_pipeline
    evaluation_result = self._evaluate(configuration, cache, dump2disk)
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/search.py", line 579, in _evaluate
    fitted_pipeline.fit(cache=cache, inputs=[self.train_dataset2[each_repeat]])
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
    self.runtime.fit(**arguments)
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/runtime.py", line 195, in fit
    primitive_arguments
  File "/user_opt/dsbox/dsbox-ta2/python/dsbox/template/runtime.py", line 284, in _primitive_step_fit
    produce_result = model.produce(**produce_params)
  File "/src/dsbox-featurizer/dsbox/datapreprocessing/featurizer/multiTable/multi_table_featurizer.py", line 67, in produce
    big_table = self._core(inputs)
  File "/src/dsbox-featurizer/dsbox/datapreprocessing/featurizer/multiTable/multi_table_featurizer.py", line 113, in _core
    target_column_name = column_metadata['foreign_key']['resource_id'] + "_" + column_metadata['foreign_key']['column_name']
  File "/usr/local/lib/python3.6/dist-packages/frozendict/__init__.py", line 29, in __getitem__
    return self._dict[key]
KeyError: 'column_name'
[INFO] push@Candidate: (-1346521914341251977,None)

serbanstan commented 6 years ago

What dataset is this run on? The log file path seems to be empty.

kyao commented 6 years ago

uu2_gp_hyperparameter_estimation

serbanstan commented 6 years ago

This problem might not stem from CorEx at all.

The uu2 dataset looks something like this

d3mIndex,gpDataFile,amplitude,lengthscale
0,train_data_934.csv,0.6115757969678771,2.2957860332947786
1,train_data_935.csv,0.026343424234522232,0.6041732289631595
2,train_data_936.csv,0.15260382863242258,1.6483227666863358
3,train_data_937.csv,1.1312855843919003,2.70460765772802
4,train_data_938.csv,1.2752346828569412,0.7611034560553084

Where each csv file appears as

x,y
2.0456766623818723,0.6391782096512566
-1.8466392232763873,0.6184222618837352
2.6007827213613983,0.794930515235289
-7.671741163940858,1.5133898945628221
-1.1978984838353632,0.2407517958498579
5.686026761365657,0.1456890019598691
-7.774695501783108,1.6665431334349772

Going through a pipeline that contains

Denormalize DatasetToDataFrame ExtractColumnsBySemanticTypes Profiler CleaningFeaturizer

CorEx gets an input looking like this

0    934.csv 719128702812473 7254263258353223 93970...
1    935.csv 967572757398724 3424373641006067 31385...
2    936.csv 655396850273306 9111688949207661 80250...
3    937.csv 937281788646402 101637380872502 624434...
4    938.csv 181595581405846 5033496891137835 70715...

with metadata

<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', 'filename'), ('location_base_uris', ('file:///nfs1/dsbox-repo/data/datasets/seed_datasets_current/uu2_gp_hyperparameter_estimation/uu2_gp_hyperparameter_estimation_dataset/tables/gp_data_tables/',)), ('media_types', ('text/csv',)), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/FileName', 'https://metadata.datadrivendiscovery.org/types/Table', 'https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/CanBeSplitByPunctuation')), ('most_common_tokens', (<FrozenOrderedDict OrderedDict([('name', 'train_data_0.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_1.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_10.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_100.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_101.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_102.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_103.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_104.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_105.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_106.csv'), ('count', 1)])>)), ('number_of_tokens_containing_numeric_char', 1000), ('ratio_of_tokens_containing_numeric_char', 1.0), ('number_of_values_containing_numeric_char', 1000), ('ratio_of_values_containing_numeric_char', 1.0)])>
<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', '1_punc_0'), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/Attribute'))])>
<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', '1_punc_1'), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/Attribute'))])>
<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', '1_punc_2'), ('semantic_types', ('http://schema.org/Text', 'https://metadata.datadrivendiscovery.org/types/Attribute'))])>

In other words, our standard template 'cleans' the dataset names and feeds them into CorEx with a wrong path, so we are unable to read them.

And, even if we were, CorEx shouldn't process this kind of data. We would need a primitive that actually uses the numeric data in the files.

RqS commented 6 years ago

CleaningFeaturizer will not split filename now

usc-isi-i2 / dsbox-ta2

Corex problem #125