py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
7.01k stars 922 forks source link

Sometimes happen PerformanceWarning in do_sample. #952

Closed yoshiakifukushima closed 1 year ago

yoshiakifukushima commented 1 year ago

Is your feature request related to a problem? Please describe. When I use sampler.do_sample method, following warning sometimes happen.

dowhy/utils/propensity_score.py:123: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling frame.insert many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy() 
data[dummies.columns] = dummies

Steps to reproduce the behavior

# do sampling
sampler = WeightingSampler(
    df,
    causal_model=model,
    keep_original_treatment=False,
    variable_types=variable_types,
)
interventional_df = sampler.do_sample(x=do)

Describe alternatives you've considered In binarize_discrete(), using pd.concat() may be better solution. (When I tried in local environment, warning disappeared and performance slightly improved)

def binarize_discrete(data, covariates, variable_types):
    to_remove = []
    if variable_types:
        for variable in covariates:
            variable_type = variable_types[variable]
            if variable_type in ['d', 'o', 'u']:
                dummies = get_dummies(data[variable])
                dummies.columns = [variable + str(col) for col in dummies.columns]
                dummies = dummies[dummies.columns[:-1]]
                covariates += list(dummies.columns)
                for var_name in dummies.columns:
                    variable_types[var_name] = 'b'
                # data[dummies.columns] = dummies
                data = pd.concat((data,dummies), axis=1) # <- use pd.concat()
                to_remove.append(variable)
    for variable in to_remove:
        covariates.remove(variable)
        del data[variable]
    return data, covariates

Version information:

DoWhy version 0.8

amit-sharma commented 1 year ago

Good point @yoshiakifukushima . Agree that using pd.concat will be a more efficient solution. Would you like to raise a PR for this? We can update the dowhy repo to include your updated code.

yoshiakifukushima commented 1 year ago

@amit-sharma Thanks for your confirmation. I raised PR as follow. https://github.com/py-why/dowhy/pull/955

amit-sharma commented 1 year ago

PR is merged, thank you