Splitting dataframe into pre-post is inconsistent with covid example

pymc-labs / CausalPy

A Python package for causal inference in quasi-experimental settings

https://causalpy.readthedocs.io

Apache License 2.0

834 stars 53 forks source link

Splitting dataframe into pre-post is inconsistent with covid example #273

Open rlaker opened 7 months ago

rlaker commented 7 months ago

In the COVID excess deaths example, the data is split into pre and post treatment with

self.datapre = data[data.index <= self.treatment_time]
self.datapost = data[data.index > self.treatment_time]

However, when inspecting result.datapre we see that 2020-01-01 is included with the label pre=False

If 2020-01-01 should be in the post set, as the df says, then the splitting should be changed to:

self.datapre = data[data.index < self.treatment_time]
self.datapost = data[data.index >= self.treatment_time]

If it should be in the pre set, then deaths_and_temps_england_wales.csv needs to be updated so that pre=True for this date

drbenvincent commented 7 months ago

Thanks for this @rlaker.

I agree with the second code snippet - date < treatment_time should be classed as pre, and date >= treatment_time should be classed as post.

It could even be worth adding some input validation and add include that under test coverage. Did you want to have a go at this and submit a PR?

drbenvincent commented 7 months ago

Though it could be worth thinking if we want treatment always defined with >=, or if there will be some occasions where a user might want > instead.

rlaker commented 7 months ago

Happy to have a go!

rlaker commented 7 months ago

Perhaps the confusion is that the example dataframe defined its own pre column, which was then ignored by the PrePostFit class which used treatment_time instead. Maybe the user should create the pre column and the class should do

self.datapre = data[data['pre'] == True]
self.datapost = data[data['pre'] == False]