synthesized-io / insight

🧿 Metrics & Monitoring of Datasets
BSD 3-Clause "New" or "Revised" License
12 stars 0 forks source link

TypeError for columns with dtype=object that could be inferred as numeric dtype #151

Closed marqueewinq closed 9 months ago

marqueewinq commented 9 months ago

Error:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Suppose i have the following column (it's from the noaa dataset): ['718270', '718090', '718090', '710680', '475840'] (in the original dataset the column also contains some nans)

The original dtype of this column is 'object' (because each item is a string). In KullbackLeiblerDivergence we have the following code:

    def _compute_metric(self, sr_a: pd.Series, sr_b: pd.Series):
        (p, q) = zipped_hist((sr_a, sr_b), check=self.check)
        ...

In zipped_hist we have:

    joint = pd.concat(data)
    is_continuous = check.continuous(pd.Series(joint))

    if is_continuous:
        np.histogram(...)

The error later is caused by the dtype=object column passed into the np.histogram function. It happens because we use check.continuous method to determine whether the column is continuous, and inside that method we use check.infer_dtype to change the dtype (to int in this case). But we never convert the original column, so np.histogram gets the series with dtype=object.

Is this a bug? If so, should this be a solution?

    joint = pd.concat(data)
    is_continuous = check.continuous(pd.Series(joint))
    joint = check.infer_dtype(joint)
    data = [check.infer_dtype(series) for series in data]