syanga / pycit

(Conditional) Independence testing & Markov blanket feature selection using k-NN mutual information and conditional mutual information estimators. Supports continuous, discrete, and mixed data, as well as multiprocessing.
MIT License
20 stars 5 forks source link

CI test with mixed data #4

Open kenneth-lee-ch opened 3 weeks ago

kenneth-lee-ch commented 3 weeks ago

Can someone show me how to use this package to conduct a conditional independence test with mixed data? Suppose I have the following data.

import pandas as pd
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Generate continuous data
data_size = 100
continuous_data_1 = np.random.normal(loc=50, scale=10, size=data_size)
continuous_data_2 = np.random.normal(loc=30, scale=5, size=data_size)

# Generate categorical data
categories = np.random.choice(['Category A', 'Category B', 'Category C'], size=data_size)

# Create DataFrame
df = pd.DataFrame({
    'Continuous_1': continuous_data_1,
    'Continuous_2': continuous_data_2,
    'Category': categories
})
syanga commented 3 weeks ago

Currently, the package only handles numerical data in numpy arrays. A quick fix would be to convert the categories to numbers first, so replace

categories = np.random.choice(['Category A', 'Category B', 'Category C'], size=data_size)

with something like

categories = np.random.choice([1, 2, 3], size=data_size)

before calling the methods.

Feel free to open a PR if you're interested in adding direct support for different datatypes, and improving the support for pandas dataframes.

kenneth-lee-ch commented 3 weeks ago

Currently, the package only handles numerical data in numpy arrays. A quick fix would be to convert the categories to numbers first, so replace s

categories = np.random.choice(['Category A', 'Category B', 'Category C'], size=data_size)

with something like

categories = np.random.choice([1, 2, 3], size=data_size)

before calling the methods.

Feel free to open a PR if you're interested in adding direct support for different datatypes, and improving the support for pandas dataframes.

How does the library recognizes [1,2,3] as categories rather than some continuous data?

syanga commented 2 weeks ago

It uses k-nearest-neighbors to detect categories. So if the knn distance of a point with value 1 is 0, then the method assumes that there is a discrete component at 1. This extends to vectors too: if the vector [1,2,3] appears many times in your data, then the estimator will assume that there is a discrete component at the point [1,2,3].

Hope that answers the question