pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.72k stars 2.01k forks source link

BUG: pm.Data converting matrix int to float #6874

Open mattijsdp opened 1 year ago

mattijsdp commented 1 year ago

Describe the issue:

When calling pm.Data on a matrix (pd.DataFrame) where all columns are of type int it still returns a tensor of type float64. This is not true when passing a pd.Series. This correctly returns a vector of type int

Reproduceable code example:

import pytensor.tensor as at

df_ex = pd.DataFrame({"cat_a": [0,1,2,3], "cat_b": [1,2,3,4]})

Error message:

import pytensor.tensor as at
import pandas as pd
import pymc as pm

df_ex = pd.DataFrame({"cat_a": [0,1,2,3], "cat_b": [1,2,3,4]})

with pm.Model() as model:
    cats_pt = pm.Data("cats", df_ex)
    print(cats_pt.type) # wrongly returns Matrix(float64, shape=(4, 2))

# vs
with pm.Model() as model:
    cats_pt = pm.Data("cats", df_ex.cat_a)
    print(cats_pt.type) # correctly returns Vector(int32, shape=(4,))

PyMC version information:

Python version: Python 3.9.7 pymc.version: '5.7.1' pytensor.version: ''2.14.2' Installed using conda Running on linux

Context for the issue:

As a temporary solution one can cast the resulting Matrix but it seems to me that the way it currently works isn't desirable?

welcome[bot] commented 1 year ago

Welcome Banner :tada: Welcome to PyMC! :tada: We're really excited to have your input into the project! :sparkling_heart:
If you haven't done so already, please make sure you check out our Contributing Guidelines and Code of Conduct.

ricardoV94 commented 1 year ago

You are passing a list. I think it will work fine if you pass a numpy array with the right dtype?

mattijsdp commented 1 year ago

@ricardoV94 so yes and no. Using .values to convert the DataFrame to a numpy array does indeed work. But if I pass a list of lists of ints straight into pm.Data it also works (pm.Data("cats", [[0,1],[1,2]])). It seems only passing a pandas DataFrame doesn't work? That's a bit weird, no?

Thanks for the quick response!

ricardoV94 commented 1 year ago

@mattijsdp you're right. It seems to be a problem in the call to convert_observed_data by pm.Data:

https://github.com/pymc-devs/pymc/blob/2303bf97de53f048f602916d69d9da93485ae8a3/pymc/pytensorf.py#L136-L145

Your example falls into the last else branch. The whole logic is a bit over-involved, because pt.as_tensor is just as happy to take a pandas Dataframe. I am not sure pm.Data should be calling this helper at all, since not all uses of it are related to observed data.

mattijsdp commented 1 year ago

@ricardoV94 I'm not sure whether the helper should be called at all but if it should be called shouldn't line 137 there be if hasattr(ret, "dtype") so use ret and not data. This would solve it as the lines above would have converted the DataFrame to a np array.

ricardoV94 commented 1 year ago

I am not sure we should be doing any casting to begin with, so maybe I would remove that whole bottom part of the function and see if any tests break