Open kenneth-lee-ch opened 2 months ago
Currently, the package only handles numerical data in numpy arrays. A quick fix would be to convert the categories to numbers first, so replace
categories = np.random.choice(['Category A', 'Category B', 'Category C'], size=data_size)
with something like
categories = np.random.choice([1, 2, 3], size=data_size)
before calling the methods.
Feel free to open a PR if you're interested in adding direct support for different datatypes, and improving the support for pandas dataframes.
Currently, the package only handles numerical data in numpy arrays. A quick fix would be to convert the categories to numbers first, so replace s
categories = np.random.choice(['Category A', 'Category B', 'Category C'], size=data_size)
with something like
categories = np.random.choice([1, 2, 3], size=data_size)
before calling the methods.
Feel free to open a PR if you're interested in adding direct support for different datatypes, and improving the support for pandas dataframes.
How does the library recognizes [1,2,3] as categories rather than some continuous data?
It uses k-nearest-neighbors to detect categories. So if the knn distance of a point with value 1 is 0, then the method assumes that there is a discrete component at 1. This extends to vectors too: if the vector [1,2,3] appears many times in your data, then the estimator will assume that there is a discrete component at the point [1,2,3].
Hope that answers the question
Can someone show me how to use this package to conduct a conditional independence test with mixed data? Suppose I have the following data.