Closed kyao closed 6 years ago
What dataset is this run on? The log file path seems to be empty.
uu2_gp_hyperparameter_estimation
This problem might not stem from CorEx at all.
The uu2 dataset looks something like this
d3mIndex,gpDataFile,amplitude,lengthscale
0,train_data_934.csv,0.6115757969678771,2.2957860332947786
1,train_data_935.csv,0.026343424234522232,0.6041732289631595
2,train_data_936.csv,0.15260382863242258,1.6483227666863358
3,train_data_937.csv,1.1312855843919003,2.70460765772802
4,train_data_938.csv,1.2752346828569412,0.7611034560553084
Where each csv file appears as
x,y
2.0456766623818723,0.6391782096512566
-1.8466392232763873,0.6184222618837352
2.6007827213613983,0.794930515235289
-7.671741163940858,1.5133898945628221
-1.1978984838353632,0.2407517958498579
5.686026761365657,0.1456890019598691
-7.774695501783108,1.6665431334349772
Going through a pipeline that contains
Denormalize DatasetToDataFrame ExtractColumnsBySemanticTypes Profiler CleaningFeaturizer
CorEx gets an input looking like this
0 934.csv 719128702812473 7254263258353223 93970...
1 935.csv 967572757398724 3424373641006067 31385...
2 936.csv 655396850273306 9111688949207661 80250...
3 937.csv 937281788646402 101637380872502 624434...
4 938.csv 181595581405846 5033496891137835 70715...
with metadata
<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', 'filename'), ('location_base_uris', ('file:///nfs1/dsbox-repo/data/datasets/seed_datasets_current/uu2_gp_hyperparameter_estimation/uu2_gp_hyperparameter_estimation_dataset/tables/gp_data_tables/',)), ('media_types', ('text/csv',)), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/FileName', 'https://metadata.datadrivendiscovery.org/types/Table', 'https://metadata.datadrivendiscovery.org/types/Attribute', 'https://metadata.datadrivendiscovery.org/types/CanBeSplitByPunctuation')), ('most_common_tokens', (<FrozenOrderedDict OrderedDict([('name', 'train_data_0.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_1.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_10.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_100.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_101.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_102.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_103.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_104.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_105.csv'), ('count', 1)])>, <FrozenOrderedDict OrderedDict([('name', 'train_data_106.csv'), ('count', 1)])>)), ('number_of_tokens_containing_numeric_char', 1000), ('ratio_of_tokens_containing_numeric_char', 1.0), ('number_of_values_containing_numeric_char', 1000), ('ratio_of_values_containing_numeric_char', 1.0)])>
<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', '1_punc_0'), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/Attribute'))])>
<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', '1_punc_1'), ('semantic_types', ('https://metadata.datadrivendiscovery.org/types/CategoricalData', 'https://metadata.datadrivendiscovery.org/types/Attribute'))])>
<FrozenOrderedDict OrderedDict([('structural_type', <class 'str'>), ('name', '1_punc_2'), ('semantic_types', ('http://schema.org/Text', 'https://metadata.datadrivendiscovery.org/types/Attribute'))])>
In other words, our standard template 'cleans' the dataset names and feeds them into CorEx with a wrong path, so we are unable to read them.
And, even if we were, CorEx shouldn't process this kind of data. We would need a primitive that actually uses the numeric data in the files.
CleaningFeaturizer will not split filename now
See /dsbox_efs/runs/seed-2018-07-26-02:04/uu2_gp_hyperparameter_estimation/supporting_files/logs/out.txt
And,