seunglee98 / fedmatch

Other
27 stars 10 forks source link

Unexpected results #25

Open systemnova opened 2 years ago

systemnova commented 2 years ago

I'm getting a few unexpected behaviours for tier_match using the code below. The results are ok for the first two tiers, but then on the later tiers it's ignoring the 'top =1' and the 'takeout=both' setting, and instead returning many more matches than rows in the tier called 'd_multi'? I also have a feeling I'm not using the takeout argument correctly, as the documentation indicates that it takes a character vector, but I cant get it to work with a vector of length >1? Is there a way to control how the takeouts work for each tier? Thanks for making such a powerful & useful set of functions!

# Settings
settings <- list(
    source_dfx =  df1,  #NB all must be character, care on numeric to switch off exponential notation.
    target_dfy =  df2  ,#NB all must be character, care on numeric to switch off exponential notation.
    uniquex = "index1", uniquey = "index2",
    matchx = c("Var1", "Var2", "Var3", "Var4", "Var5"),
    matchy = c("Var1", "Var2", "Var3", "Var4", "Var5"),
    weights = c(1, 1, 1, 1, 1),
    compare = c("stringdist", "stringdist", "difference", "stringdist", "stringdist"),
    takeout = as.character(c("both", "both", "both", "both", "both")) 
  )

# Tiers
tier_list_v1 <- list(
  a_exact = build_tier(match_type = "exact", allow.cartesian = F, by.x = settings$matchx[1], by.y = settings$matchy[1]), #ON NAMES

  b_fuzzy = build_tier(match_type = "fuzzy", allow.cartesian = F, by.x = settings$matchx[1], by.y = settings$matchy[1], #ON NAMES
                     fuzzy_settings = build_fuzzy_settings( method = "wgt_jaccard", nthread = 2, maxDist = .3, matchNA = F)),

  c_fuzzy = build_tier(match_type = "fuzzy", allow.cartesian = F, by.x = settings$matchx[1], by.y = settings$matchy[1], #ON NAMES
                       fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2, maxDist = .5, matchNA = F)),

  d_multi = build_tier(match_type = "multivar", allow.cartesian = F, by.x = settings$matchx, by.y = settings$matchy,
                 multivar_settings = build_multivar_settings(missing = F, wgts = settings$weights, compare_type = settings$compare,
                                                             threshold = .9, top = 1)),

  e_multi = build_tier(match_type = "multivar", allow.cartesian = F,  by.x = settings$matchx, by.y = settings$matchy,
                 multivar_settings = build_multivar_settings(missing = FALSE, wgts = settings$weights, compare_type = settings$compare,
                                                             threshold = NULL, top = 1))
)

# Apply Match
result <- tier_match(settings$source_dfx, settings$target_dfy, unique_key_1 = settings$uniquex, unique_key_2 = settings$uniquey,tiers = tier_list_v1, verbose = T, clean = T, #Manually clean above
                     takeout = settings$takeout[1], allow.cartesian = F,
                     #score_settings = build_score_settings(score_var_x = settings$matchx, score_var_y = settings$matchy, wgts = settings$weights, score_type = settings$compare)
)
c0webster commented 2 years ago

Thanks for this! The "takeout" setting is not vectorized; it only takes a vector of length 1 (I will make a note to add this to the documentation.) You can only have it take out matches (or not) for the entire tier_match, not each tier.

To do what you want, you'll simply need to run a few different tier_match calls, each with different settings for takeout.

systemnova commented 2 years ago

Thanks Chris, I will split up the matching into a few tier_match calls. However, separate to the non-vectorised takeout, it seems to not be consistently applying takeout = "both" across all tiers when not vectorised. For example, the following match evaluation result was produced from takeout = "both". It seems to work for the first three tiers, but doesn't remove results in tier 4, with 150 matches? Potentially because tier 4 swaps to be multivar match, after two fuzzy match tiers? image

Thanks again for creating such an awesome package.

c0webster commented 2 years ago

Ah, I see! Thanks for pointing this out. I am unfortunately headed out for vacation for a few weeks, so I won't be able to dive into this for a while. I am guessing that you're simply correct, and that it doesn't correctly pull out matches from multivar. For now, you'll just need to continue breaking up the steps. So you do your first three tiers in one tier match, then run a line to remove those ids from the data, then do a merge_plus for each multivar match that you want.

I apologize for this bug, and appreciate your help figuring it out.