klarsen1 / MarketMatching

Other
131 stars 37 forks source link

no valid data in post-period using dynamic warping #28

Closed sssiv93 closed 10 months ago

sssiv93 commented 1 year ago

I am getting the following error when I pass the output of best_matches to inference:

Error in stopif(length(post_period) == 0, TRUE, "ERROR: no valid data in the post period") : 
  ERROR: no valid data in the post period
In addition: Warning message:
In max(date) : no non-missing arguments to max; returning -Inf

This is the code I am using to call best_matches.

mm_uk <- best_matches(data=joins_data_mm_uk,
                      id_variable="regions",
                      date_variable="dates",
                      markets_to_be_matched=c("UK"),
                      matching_variable="kpi",
                      parallel=FALSE,
                      warping_limit=1, # warping limit=1
                      dtw_emphasis=1, # rely only on dtw for pre-screening
                      matches=10, # request 15 matches
                      start_match_period="2022-06-01",
                      end_match_period="2022-08-07")

This error only seems to crop up when I set the dtw_emphasis to 1. With any other value of dtw_emphasis, the inference function runs without any problem.

What are the steps you would recommend to try and debug this? From what I can see, there is valid data in the post period for both of the selected control markets.

Thanks!

klarsen1 commented 1 year ago

I’ve never seen this one. How many data points do you have in the post period? You have to consider the fact that DTW will require extra data points for the window.

Also, I generally don’t recommend to set that parameter to 1.

If you have a dataset I’m happy to try it out and see if it works for me.

K

On Wed, Feb 8, 2023 at 9:05 AM sssiv93 @.***> wrote:

I am getting the following error when I pass the output of best_matches to inference:

Error in stopif(length(post_period) == 0, TRUE, "ERROR: no valid data in the post period") : ERROR: no valid data in the post period In addition: Warning message: In max(date) : no non-missing arguments to max; returning -Inf

This is the code I am using to call best_matches.

mm_uk <- best_matches(data=joins_data_mm_uk, id_variable="regions", date_variable="dates", markets_to_be_matched=c("EXPE_UK"), matching_variable="kpi", parallel=FALSE, warping_limit=1, # warping limit=1 dtw_emphasis=1, # rely only on dtw for pre-screening matches=10, # request 15 matches start_match_period="2022-06-01", end_match_period="2022-08-14")

This error only seems to crop up when I set the dtw_emphasis to 1. With any other value of dtw_emphasis, the inference function runs without any problem.

What are the steps you would recommend to try and debug this? From what I can see, there is valid data in the post period for all the selected control markets.

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/klarsen1/MarketMatching/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNKU5HVDNGYJWOGKTJ7MQTWWPG6LANCNFSM6AAAAAAUVQPU5Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

sssiv93 commented 1 year ago

Unfortunately, I would not be able to share the dataset with you as it is confidential, but thank you for offering!

The post period over which I want to measure whether there was an impact is from "2022-08-07" to "2022-08-14", and therefore I set the end_match_period to "2022-08-14". In my dataset joins_data_mm_uk, I have data going up to the end of 2022, so I assume that would be enough data points?

Thanks for the advice! Can I ask why you would not recommend setting the dtw_emphasis to 1?

sssiv93 commented 1 year ago

I think the issue seems to be something to do with the length of the post period. Is there a minimum size for the post period?

When the post period is only 1 week, I get the above error.

resultsuk <- MarketMatching::inference(matched_markets = mmca,
                                       test_market = c("UK"),
                                       end_post_period = as.Date("2022-08-14"),
                                       nseasons = 7)

When the post period is over a month, I do not get an error anymore:

resultsuk <- MarketMatching::inference(matched_markets = mmca,
                                       test_market = c("UK"),
                                       end_post_period = as.Date("2022-09-14"),
                                       nseasons = 7)
klarsen1 commented 1 year ago

There is. I think it’s 3 (have to check) — after dealing with the DTW windows. So if you actually need more than 3 if you’re using DTW.

The reason I’m not a fan of DTW in general for general business use cases is that in most cases you don’t need to match adjacent data points. You matching along weeks or days or months and don’t need to check of there are better adjacent matches.

On Thu, Feb 9, 2023 at 6:53 AM sssiv93 @.***> wrote:

I think the issue seems to be something to do with the length of the post period. Is there a minimum size for the post period?

When the post period is only 1 week, I get the above error.

resultsuk <- MarketMatching::inference(matched_markets = mmca, test_market = c("UK"), end_post_period = as.Date("2022-08-14"), nseasons = 7)

When the post period is over a month, I do not get an error anymore:

resultsuk <- MarketMatching::inference(matched_markets = mmca, test_market = c("UK"), end_post_period = as.Date("2022-09-14"), nseasons = 7)

— Reply to this email directly, view it on GitHub https://github.com/klarsen1/MarketMatching/issues/28#issuecomment-1424314881, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNKU5C6WCEO23MTHNMG4OLWWUAFLANCNFSM6AAAAAAUVQPU5Q . You are receiving this because you commented.Message ID: @.***>

sssiv93 commented 1 year ago

Thank you! That makes sense.

Unfortunately, I think my theory above was not correct. I was trying this out on a different dataset and find that it still gives the same error even if I extend the post period to two months.

I also spotted that, for the following set of parameters, if I request 15 matches, it will give me 5 matches but if I request 5 or 10 matches, it will not return any matches.

mm <- best_matches(data=dataset,
                   id_variable="regions",
                   date_variable="dates",
                   markets_to_be_matched=test_posa,
                   matching_variable="kpi",
                   parallel=FALSE,
                   warping_limit=1,
                   dtw_emphasis=1,
                   matches=15,
                   start_match_period="2022-06-05",
                   end_match_period="2022-08-07")

If it is still an option, I will look at renaming some of the variables so that the data can be shared with you. Thanks for your help around this.

klarsen1 commented 1 year ago

Ok. First, I’d just drop the DTW stuff. I’m 99% sure you don’t need it. Just use correlations— dtw emphasis = 0.

Good to have a solid post period.

For me to dig in, I’d need a dataset. This package has lots of downloads and this has never come up before so the only way I can detect is to run it.

K

On Thu, Feb 9, 2023 at 10:08 AM sssiv93 @.***> wrote:

Thank you! That makes sense.

Unfortunately, I think my theory above was not correct. I was trying this out on a different dataset and find that it still gives the same error even if I extend the post period to two months.

I also spotted that, for the following set of parameters, if I request 15 matches, it will give me 5 matches but if I request 5 or 10 matches, it will not return any matches.

mm <- best_matches(data=dataset, id_variable="regions", date_variable="dates", markets_to_be_matched=test_posa, matching_variable="kpi", parallel=FALSE, warping_limit=1, dtw_emphasis=1, matches=15, start_match_period="2022-06-05", end_match_period="2022-08-07")

If it is still an option, I will look at renaming some of the variables so that the data can be shared with you. Thanks for your help around this.

— Reply to this email directly, view it on GitHub https://github.com/klarsen1/MarketMatching/issues/28#issuecomment-1424612205, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNKU5FZABHC5GVST22ZZI3WWUXABANCNFSM6AAAAAAUVQPU5Q . You are receiving this because you commented.Message ID: @.***>

sssiv93 commented 1 year ago

Thanks Kim. I will take your advice to avoid a full weighting of dynamic warping.

If you are still able to look into the dataset when you have some time, that would be greatly appreciated. The dataset is attached below, with the code following.

mm_data.xlsx

install.packages("readxl")
library("readxl")
library(MarketMatching)
data <- read_excel("mm_data.xlsx")

data$dates <- as.Date(data$dates)

# Problem 1 - no valid data in the post period
mm <- best_matches(data=data,
                   id_variable="regions",
                   date_variable="dates",
                   markets_to_be_matched="BRANDA_UK",
                   matching_variable="kpi",
                   parallel=FALSE,
                   warping_limit=1,
                   dtw_emphasis=1,
                   matches=15,
                   start_match_period="2022-06-05",
                   end_match_period="2022-08-07")

mm$BestMatches

results <- MarketMatching::inference(matched_markets = mm,
                                     test_market = "BRANDA_UK",
                                     end_post_period = "2022-10-23",
                                     alpha = 0.05,
                                     prior_level_sd = 0.01,
                                     nseasons=7)

# Problem 2 - requesting 5 matches gives no matches, and requesting 15 matches gives 5 matches
mm <- best_matches(data=dataset,
                   id_variable="regions",
                   date_variable="dates",
                   markets_to_be_matched="BRANDA_UK",
                   matching_variable="kpi",
                   parallel=FALSE,
                   warping_limit=1,
                   dtw_emphasis=1,
                   matches=5,
                   start_match_period="2022-06-05",
                   end_match_period="2022-08-07")

mm$BestMatches # No matches returned

mm <- best_matches(data=dataset,
                   id_variable="regions",
                   date_variable="dates",
                   markets_to_be_matched="BRANDA_UK",
                   matching_variable="kpi",
                   parallel=FALSE,
                   warping_limit=1,
                   dtw_emphasis=1,
                   matches=15,
                   start_match_period="2022-06-05",
                   end_match_period="2022-08-07")

mm$BestMatches # 5 matches returned
klarsen1 commented 10 months ago

By the way, I happened to look into this much later.

This helps: data <- read_excel("mm_data.xlsx") %>% mutate(date=as.Date(dates))

Looking into why dtw won't return more than 5 matches -- but I really don't recommend dtw to be honest.

On Fri, Feb 10, 2023 at 3:26 AM sssiv93 @.***> wrote:

Thanks Kim. I will take your advice to avoid a full weighting of dynamic warping.

If you are still able to look into the dataset when you have some time, that would be greatly appreciated. The dataset is attached below, with the code following.

mm_data.xlsx https://github.com/klarsen1/MarketMatching/files/10706746/mm_data.xlsx

install.packages("readxl") library("readxl") library(MarketMatching) data <- read_excel("mm_data.xlsx")

data$dates <- as.Date(data$dates)

Problem 1 - no valid data in the post period

mm <- best_matches(data=data, id_variable="regions", date_variable="dates", markets_to_be_matched="BRANDA_UK", matching_variable="kpi", parallel=FALSE, warping_limit=1, dtw_emphasis=1, matches=15, start_match_period="2022-06-05", end_match_period="2022-08-07")

mm$BestMatches

results <- MarketMatching::inference(matched_markets = mm, test_market = "BRANDA_UK", end_post_period = "2022-10-23", alpha = 0.05, prior_level_sd = 0.01, nseasons=7)

Problem 2 - requesting 5 matches gives no matches, and requesting 15 matches gives 5 matches

mm <- best_matches(data=dataset, id_variable="regions", date_variable="dates", markets_to_be_matched="BRANDA_UK", matching_variable="kpi", parallel=FALSE, warping_limit=1, dtw_emphasis=1, matches=5, start_match_period="2022-06-05", end_match_period="2022-08-07")

mm$BestMatches # No matches returned

mm <- best_matches(data=dataset, id_variable="regions", date_variable="dates", markets_to_be_matched="BRANDA_UK", matching_variable="kpi", parallel=FALSE, warping_limit=1, dtw_emphasis=1, matches=15, start_match_period="2022-06-05", end_match_period="2022-08-07")

mm$BestMatches # 5 matches returned

— Reply to this email directly, view it on GitHub https://github.com/klarsen1/MarketMatching/issues/28#issuecomment-1425663887, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNKU5AOM4EP563MFWHCS33WWYQXBANCNFSM6AAAAAAUVQPU5Q . You are receiving this because you commented.Message ID: @.***>

SeanRichterWalsh commented 10 months ago

By the way, I happened to look into this much later. This helps: data <- read_excel("mm_data.xlsx") %>% mutate(date=as.Date(dates)) Looking into why dtw won't return more than 5 matches -- but I really don't recommend dtw to be honest.

So the preferred matching approach is to fully use correlation rather than distance? I had wondered about that as even though size might differ, two markets could have a strong correlation and be well matched. However, matching on distance is good if you want markets to have similar sizes. Is my understanding correct here?

Also, out of curiosity is the correlation done on the time series once trend and/or seasonality are removed? By differencing first, for example.

Thanks.

klarsen1 commented 10 months ago

Good question.

My statement came across a bit strongly.

In most cases, I think correlation is just fine -- especially as you're aggregating markets. I'm less worried about size. And the BSTS model will apply weights anyway to each market. So DTW seems like overkill to some degree.

K

On Thu, Jan 11, 2024 at 4:24 AM Sean Walsh @.***> wrote:

By the way, I happened to look into this much later. This helps: data <- read_excel("mm_data.xlsx") %>% mutate(date=as.Date(dates)) Looking into why dtw won't return more than 5 matches -- but I really don't recommend dtw to be honest.

So the preferred matching approach is to fully use correlation rather than distance? I had wondered about that as even though size might differ, two markets could have a strong correlation and be well matched. However, matching on distance is good if you want markets in a test group to have similar sizes. Is my understanding correct here? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/klarsen1/MarketMatching/issues/28#issuecomment-1887051390, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNKU5AW6UOUOZO7ZNYEVFLYN7KYHAVCNFSM6AAAAAAUVQPU5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGA2TCMZZGA . You are receiving this because you commented.Message ID: @.***>

SeanRichterWalsh commented 10 months ago

Yes, that makes sense. Thanks.

klarsen1 commented 10 months ago

also, try devtools::install_github("klarsen1/MarketMatching") -- the new version will deal with date variables with a suffix (like UTC) and tell you have many records you need.