Closed serbanstan closed 6 years ago
A quick fix is to move CastToType float to before SKPCA, since SKPCA requires all numerical columns.
But, there is a subtle bug that needs to addressed in the future. Some runs of the same pipeline would succeed, while others would fail. On some runs the Encoder would encode a particular column, while other runs it would not. I think the difference is because on some runs we are using a subset of the dataset, while on other runs we would be using another subset or the entire dataset. And, those particular columns fall on the boundary of rule of being categorical.
I'm guessing your advice is for https://github.com/usc-isi-i2/dsbox-ta2/issues/157
I guess is for both. The reason all our of the pipelines fail for LL0_6332 is because dsbox_generic_steps allows str columns to get to SKPCA.
Unfortunately this issue isn't solved by swapping PCA and cast_to_type. The reason is that sometimes SKMaxAbsScaler also breaks, with a similar error.
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', 2889272456499411198)
shape: (540, 39)
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -6995438698222877734)
shape: (540, 40)
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 5801274934117132541)
shape: (540, 49)
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', 2401217275340853352)
shape: (540, 138)
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', 655403304700757436)
shape: (540, 138)
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', 2018003916215478763)
Traceback (most recent call last):
File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 552, in evaluate_pipeline
evaluation_result = self._evaluate(configuration, cache, dump2disk)
File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 762, in _evaluate
fitted_pipeline2.fit(cache=cache, inputs=[self.all_dataset])
File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/pipeline/fitted_pipeline.py", line 94, in fit
self.runtime.fit(**arguments)
File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 209, in fit
primitive_arguments
File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/runtime.py", line 303, in _primitive_step_fit
produce_result = model.produce(**produce_params)
File "/nfs1/dsbox-repo/stan/sklearn-wrap/sklearn_wrap/SKMaxAbsScaler.py", line 118, in produce
output = clf.fit_transform(sk_inputs)
File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/base.py", line 517, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 810, in fit
return self.partial_fit(X, y)
File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 827, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
_assert_all_finite(array)
File "/nfs1/dsbox-repo/stan/miniconda/envs/dsbox-devel-710/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you print out the metadata associated with the dataframe?
You can print the metadata by using dataframe.metadata.pretty_print()
.
I'm attaching a log file with the output with from the run. It's too long to post here.
This output can be reproduced by adding the following code at line 182
in runtime.py
try:
print("step: ", i)
print("shape: ", primitive_arguments['inputs'].shape)
print(primitive_arguments['inputs'].metadata.pretty_print())
except:
print("shape: N/A")
It seems the error stems from the following behavior. Profiler
adds NaNs, and no primitive coming after it in the pipeline is able to eliminate them.
[INFO] Now are training the pipeline with all dataset and saving the pipeline.
step: 0
shape: N/A
[INFO] Push@cache: ('d3m.primitives.dsbox.Denormalize', 367880002551074194)
step: 1
shape: N/A
[INFO] Push@cache: ('d3m.primitives.datasets.DatasetToDataFrame', 2027443791237794836)
step: 2
shape: (540, 41)
nans in df: False
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 998975731857451977)
step: 3
shape: (540, 41)
nans in df: False
[INFO] Push@cache: ('d3m.primitives.data.ExtractColumnsBySemanticTypes', 8174239061367614036)
step: 4
shape: (540, 39)
nans in df: False
[INFO] Push@cache: ('d3m.primitives.dsbox.Profiler', -4156860004676617922)
step: 5
shape: (540, 39)
nans in df: True
[INFO] Push@cache: ('d3m.primitives.dsbox.CleaningFeaturizer', -286274654852111623)
step: 6
shape: (540, 40)
nans in df: True
[INFO] Push@cache: ('d3m.primitives.dsbox.CorexText', 5354787078332680397)
step: 7
shape: (540, 49)
nans in df: True
[INFO] Push@cache: ('d3m.primitives.dsbox.Encoder', -5935975047104356091)
step: 8
shape: (540, 138)
nans in df: True
[INFO] Push@cache: ('d3m.primitives.dsbox.MeanImputation', 1477686608655536678)
step: 9
shape: (540, 138)
nans in df: True
[INFO] Push@cache: ('d3m.primitives.sklearn_wrap.SKMaxAbsScaler', 8867867557532755693)
Traceback (most recent call last):
File "/nfs1/dsbox-repo/stan/dsbox-ta2/python/dsbox/template/search.py", line 552, in evaluate_pipeline
evaluation_result = self._evaluate(configuration, cache, dump2disk)
Can you point me to your output/log directory?
/nas/home/stan/dsbox/runs2/output-ll0/LL0_6332_cylinder_bands
It's a bug in the profiler. It determines that the caliper
column is a float. It sets the semantic type float
, but in the process it removes its semantic type attribute
(__ALL_ELEMENTS__, 22)
Metadata:
{
"name": "caliper",
"structural_type": "float",
"semantic_types": [
"http://schema.org/Float"
],
}
With the attribute
semantic type the imputer will not operate on that column. It's too late to change the profiler. But a work around is to modify runtime.py. Whenever we see a semantic type float, we want to make sure it also has a type attribute
"semantic_types": [
"http://schema.org/Float",
"https://metadata.datadrivendiscovery.org/types/Attribute"
]
That is my problem in profiler. Fixed in profiler: https://github.com/usc-isi-i2/dsbox-cleaning/blob/961d92886916dfbc0a0e1bfd2a51e9c4677301f7/dsbox/datapreprocessing/cleaner/data_profile.py#L345
I am working on adding the work around
Some column just fall on the boundary of detecting if it is categorical data or not, need to consider the threshold later.
For the bug in profiler, added a work around way in runtime.
We are unable to find successful pipelines on this dataset. However, size doesn't seem to be the problem.