Open xxwywzy opened 3 years ago
I am experiencing the exact same issue in this Notebook "The Causal Story Behind Hotel Booking Cancellations". And no clue what is going wrong here. In another Notebook such as https://github.com/microsoft/dowhy/blob/master/docs/source/example_notebooks/dowhy_example_effect_of_memberrewards_program.ipynb no such problem arises !
@xxwywzy @jbdatascience thanks for raising this issue. It turns out that the dataframe still contained some NA values that led to this error.
To fix this error, simply add
dataset.dropna(inplace=True)
before you load the dataset into DoWhy's CausalModel
. I've also updated the notebook on Github: https://github.com/microsoft/dowhy/blob/master/docs/source/example_notebooks/DoWhy-The%20Causal%20Story%20Behind%20Hotel%20Booking%20Cancellations.ipynb
Why this error occurs in the new version of DoWhy?: We have updated how the backdoor variables are selected. In the previous version, DoWhy would only select a minimal set of variables that block all the backdoor paths. While this is correct from an identification point of view, it can be statistically efficient to additionally condition on other eligible variables (e.g., causes of the outcome, effect modifiers, etc.). That is the default behavior in the new version, and it meant that the code was conditioning on additional columns of the dataset (one of which had the NA value).
Additionally, identify_effect()
now outputs all possible backdoor adjustment variable sets that are valid. That is why you are seeing >200 sets. That is expected behavior, just that this graph has a number of effect modifiers which led to the explosion in the number of valid backdoor sets outputted.
Got it ! Thank you very much for the explanation!I think I have to learn more about the backdoor criteria :)
Thanks for the solution and the explanation of it! I am going to try it out today !
I have a new question about this notebook. The estimation result is a negative number (nearly -0.33). The notebook says that it means if the probability of cancellation was 0<𝑥<1, then changing the room causes the probability to go to 𝑥+0.33. So the effect of our treatment is 33 percentage points. However, acccording to the definition of estimate_effect, it represents the amount of change in the outcome value when you intervene and change the treatment. I.e., if we increase the value of the treatment by 1.0, then the outcome will change by the value of 'estimate'.
Therefore, in this example the treatment is 'different_room_assigned', if we change this treatment from 0 (False) to 1 (True), shouldn' t we get a postive estimate value showing that the cancel rate arises by 0.33? I am quite confused about the result and could you briefly explain to me the meaning of the estimation result? Thank you so much !
That's a great question @xxwywzy Yeah, this result is confusing. I'm paging @Sid-darthvader here, who contributed this notebook.
@Sid-darthvader Can you help us interpret the estimate based on your knowledge of the hotel dataset? It seems that different_room_assigned
actually decreases the fraction of cancellations. Perhaps it is because that most customers who do get assigned a different room actually need to show up to the hotel, and they are less likely to cancel their booking then? If so, then this is a case of selection bias and the graph needs to be updated.
Any reply from sid ? I have the same question as xxwywzy. I followed the page from https://microsoft.github.io/dowhy/readme.html to Hotel booking cancellations https://towardsdatascience.com/beyond-predictive-models-the-causal-story-behind-hotel-booking-cancellations-d29e8558cbaf , it said This tells us that on an average the Probability of a hotel booking being cancelled decreases by ~36% when the Person is assigned the same room compared to the case when he is assigned a different room than what he had chosen during booking. The explanation is realy different from the hotel page in case study books. I tried to analysis between breast cancer metastases and death, got mean value -0.40, then how to explain it...
@dearfad Did not hear back from @Sid-darthvader My sense is that the the text on towardsdatascience blog is incorrect and needs to be updated. I updated the hotel notebook in dowhy case study, so I would say you can trust the notebook more.
For your other example, a reasonable prior is that breast cancer metastases increases probability of death. If you are obtaining a negative value for its treatment effect, then that means that you have left out a confounding variable. Here is a good read on this topic where they found that the asthma patients had a lower chance of death due to pneumonia (again, negative treatment effect on death), but then they figured that asthma patients were given better care and that was the confounding factor (paper, see section 1, motivation).
Hi, Thank you for creating such a great library for causal inference ! I am currently starting to learn the library through the provided notebook: The Causal Story Behind Hotel Booking Cancellations. Everything goes well before the identification step. However, when performing the identification, the model returns a total of 258 estimand instead of only one estimand in the given example. Then, the estiimation throw an error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). Since I am not quite familiar with the library, could you explain what happens and how to fix this error? Thank you very much !
Here is the error: