Closed benthestatistician closed 4 years ago
Hi, I have a related question. If I have a larger grouping problem (say 5000 x 5000), and I'm trying to force a 1-1 matching. For reference, I start with a data frame where I have 20 predictor variables (columns) in addition to the 1 column indicating the observation grouping (1/0 for either test/control to be paired)
The key thing here is I want to convert it to a sparse matching problem since there are many combinations / pairs I know to be invalid from prior distance neighbor estimations (k-nearest neighbor graphs on the whole data observation space, or similar). This is why I was drawn towards trying your package, since it seemed to allow exactly what I needed.
I tried clustering all observations from both test and control groups, and then using this clustering variable as a predictor as shown in your vignette example:
tmp <- exactMatch(Group ~ Cluster,data=class.df)
Here, class.df just has the Group indicator, and the cluster number/membership (hoping that this will only do matching within each cluster).
I then use this to compute distances, hoping this will no longer take as much time as it did if I weren't providing the added constraint, using the within parameter specifcation:
distances$exactMahal <- match_on(Group ~ ., data = my.df, within = tmp)
But for some reason, even when doing this, the match_on function call to compute distances (shown above) takes very long, I get a warning about the limit and size of the matching problem after, and I don't think it is using a sparse matrix at all (maybe I'm not using this correctly?).
Is there a way for me to ensure it is only considering matches that I say to consider, to avoid calculating distances for those that are invalid (computationally and time intensive). The goal here is to do both 1-1 matching for the entire dataset, but also do it quickly such that distances between invalid pairs aren't being calculated, thus saving a lot of time.
Also, what does it mean / what do you do when you have NA pairs after calling pairmatch?
Sorry if this wasn't clear enough, appreciate any input you have
Hi @vkartha , I'll respond very quickly to the easy question at the end. In the factor variable returned by pairmatch()
, an NA
indicates a unit that was not matched. Units that were placed into matched sets are assigned a factor level specific to their matched set.
This doesn't yet address the 2 bigger questions:
within=
argument to match_on()
in fact being ignored? Cluster
? My guesses are that the answers are Yes and Yes, but I'm not 100%. I think you can disambiguate question 1 by seeing how many potential matches ("arcs") are allowed by your tmp
. Pretty sure you'll get this from length(tmp)
. Then compare this number to the number in your warning about the size of the matching problem.
For (2), a partial workaround would be to replace your line defining distances$exactMahal
with:
distances$exactMahal <- match_on(Group ~ ., data = my.df) + tmp
In this solution, optmatch still calculates all of the Mahalanobis distances, irrespective of Cluster
, and stores them temporarily; but then it immediately cuts things down to just those distances respecting boundaries of Cluster
, freeing up memory for you. Importantly, you only present the sparse distance to the min-cost flow solver. A still better workaround would avoid the intermediate calculation of unneeded distances. Hopefully if one of my collaborators has one of those they'll pipe up.
Thanks for you interest in the package!
Thank you for such a prompt and clear response! I will try your suggestion.
Sorry for the follow up question, but trying to work towards an alternative solution, if I pre-compute a distance matrix myself (where I have generated a test x control distance matrix, and specifically made it "sparse" by forcing certain invalid pairs to be Inf), can I just directly provide this matrix as input to pairmatch somehow? From previous efforts, I get an error saying pairmatch expects any valid input to match_on, meaning I cannot bypass the match_on step if I were to try it this way? I'm assuming there is a way to do this, given the output from match_on itself contains such a dense/sparse matrix version (just not the same object class). To be clear, I wish to directly use the more commonly used sparseMatrix-like format (from the Matrix package) directly as input to pairmatch.
Also, sorry I forgot to mention in my first message when bringing up the issue, but thank you so much for developing this package in the first place (regardless of whether I can use it or not in the end), and for documenting it to the extent you have!
Re conversion from Matrix package formats, please see #188.
In the way of workarounds, if you're wiling to store your sparse matrix densely then you can do the following.
Inf
s for forbidden pairings. optmatch::as.InfinitySparseMatrix()
. At step 2 the Inf
s get taken out again, and you once again have a sparse representation. (Indeed, code you use to implement this could be handly for testing of functionality proposed in #188 the combination of these 2 steps might give a nice test for a function you write to perform this conversion without the intermediate dense matrix step.)
You'll want to make sure that you've got row and column names associated with the matrix you start with, and that they correspond with the rownames of the data frame that you pass to pairmatch()
in its data=argument
. (Or, if you're not using a data frame with rownames, to a character vector of unit names that you pass as that data=
argument.)
Thanks again, your suggestion seemed to help me go in the right direction. I basically made my own euclidean matrix distMat
, and assigned Inf
values at specific indices in the matrix based on my own pair constraints, in the example I gave you above. This matrix does indeed have row and column names, to identify the observation labels.
I then converted this to an InfinitySparseMatrix
using the function, as you suggested, and directly ran pairmatch
as follows:
pairmatch(as.InfinitySparseMatrix(as.matrix(distMat)))
As far as the data
input for the above line, I re-sort the factors returned by pair-match, ensuring they are in correct "tuples" (i.e. every two corresponding to a given pair) and then fetch the corresponding name variables to make a data frame of pairs (OK if the order is not the same as in the input matrix).
Do you approve of this implementation? It definitely seems to work faster now, even for large input dimensions. Again, thank you for your prompt replies and help with implementing this package
Good to hear you have a working solution! It makes sense to me, and if the intermediate conversion to a dense matrix isn't hogging up all of your available memory, then I understand if you won't at this point be contributing code for a direct conversion without that intermediary. However, if you'd be willing, please record over on the #188 thread:
distMat
(available via is(distMat)
)
3bac75b has an example. I expect a solution is just to apply
caliper()
to thewithin()
argument, then add it together with the matrix or ISM before returning.