Open bengoehring opened 1 year ago
Thanks so much for sharing this with us! We will try your function but feel free to make a pull request.
In blockData
we are using a function that goes after not the unique values but the unique pairs of values, that is why we end up with more windows, but blocking operations are linear, not quadratic, so your approach is better.
We are revising many of our functions and I will make sure this issue is addressed in our next release.
As always, if anything, do not hesitate to let us know.
Ted
Hello,
Thank you for making and maintaining such a helpful package.
I am reaching out with a conceptual question about the window blocking option in blockData --- and a possible performance improvement suggestion. This all stems from trying and failing to window block a dataset with a few million rows and a dataset with about 10 million rows using a cluster with 10 cores and 180GB of RAM. It timed out after 24 hours.
Based on the documentation of window blocking (i.e., "a given observation in dataset A will be compared to all observations in dataset B where the value of the blocking variable is within ±K of the value of the same variable in dataset A"), I would expect the window blocking option to return a list of N lists --- where N refers to the number of unique values of the window blocking variable in dataset A. Each of the N lists will then contain two vectors of indices. The first vector will include the indices of dataset A where the window blocking variable equals the nth unique value of the window blocking variable in dataset A. The second vector will include the indices of dataset B where the window blocking variable is +/- K the nth unique value of the window blocking variable in dataset A.
I hope that makes sense.
It appears, however, that the window blocking option is doing something different. For instance, If I run:
The number of separate blocks (3390) is much higher than I would expect (184). Would you be able to expand upon where I am misunderstanding?
I am guessing I am just misunderstanding something, but if the logic above is (miraculously) correct, I went ahead and implemented it in a separate function. It appears that it outperforms the default window blocking option in terms of speed (~50 times faster in this example). Please just let me know If I am onto something and you would like me to submit a pull request. My apologies if I am totally off base with this!!
Thank you for your time and all of your hard work maintaining this great package. It is much appreciated.
Best, Ben