Open xhkong opened 5 years ago
For reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html
I can only see this working with a sort based join FYI.
@vyasr might the recent conditional join / AST work perhaps be relevant for this issue? Copying the summary from the pandas docs linked above.
Perform an asof merge.
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
The default is “backward” and is compatible in versions below 0.20.0. The direction parameter was added in version 0.20.0 and introduces “forward” and “nearest”.
Optionally match on equivalent keys with ‘by’ before searching with ‘on’.
This could not be implemented with the AST as it requires storing state (i.e., the "closest so far" value). I'm not sure how you would implement this in a way that is not terrible.
The kernel computing conditional joins currently uses a 1D grid to parallelize only over the rows of the left table. We probably want to retain the flexibility to change the grid layout in the future if we find a more performant approach for conditional joins, but a slightly modified version of the current kernel that stores the "closest so far" value in a kernel-local variable should work for this use case, right? Note that the pandas API specifically requires that both DataFrames are sorted to begin with. I'm imagining something like the following (in very rough pseudocode):
join_index = SENTINEL
finished = false
for row in right:
if not finished and condition(row):
join_index = index(row)
else:
finished = true
if join_index == SENTINEL:
handle_no_join()
else:
add_pair_to_cache(left_row, join_index)
with condition = ast_operator::GREATER
for backwards
and condition = ast_operator::LESS
for forwards
. nearest
would require a little extra logic using ast_operator::GREATER
but then doing a comparison of two values the first time the condition is False.
Quick update: my interest in this issue has grown since I started researching sort-based join algorithms such as the "inequality_join" in DuckDB, "asof join" in polars, and "merge_asof" in pandas.
DuckDB has officially added "AS OF" joins as of the v8.0.0 release (pun intended).
Both Spark and DuckDB implement "ASOF" join using slightly different translations to operators that I think both CUDF and Dask already support. These translations allow the processing to be mostly distributed, which is really nice.
Spark's translation will do a join followed by an aggregation.
In this MIN_BY
is essentially an ARGMIN
aggregation followed by gather using the index returned on the first column passed to the MIN_BY
. The problem here is that the join will likely explode. They use the tolerance from pandas to reduce the window (Spark only supports this for their pandas compatibility layer currently).
I have not tried but want to. We have not been looking at it in depth because it is only for pandas compatibility right now.
Where as DuckDB appears to be doing a lead of 1 in a window operation to get a min/max value, but the default for the last value in the lead not null
it is infinity
so that they can get the proper range.
And then DuckDB does a conditional join bounding the left hand side key by values in that range. It is a conditional join.
The problem here is that if the asof join does not include any equality operations the window operation is likely to require all of the data to go to a single task (at least when doing this how Spark does it, not sure on DuckDB or Dask)
Both of these implementations are likely to require a cross join if are no equality operations in the join condition. I don't think that is very likely (I think the DuckDB example is bad), but I do think that there are ways that we can make it much better if we need to.
This feature is still of interest for libcudf, and we may choose a segmented sort-based join that uses binary search to locate correct matches.
Looking at the pandas API for merge_asof, there are a few key arguments that our algorithm should support:
on
: references the numeric column that is used to find closest matchesby
: references one or more columns that must be equal in right
and left
before searching the on
valuesdirection
: "forward", "backward" or "nearest" defines how to match the on
valuestolerance
: don't match if the closest on
values are too far apartallow_exact_matches
: whether to match when on
values are ==
or only find the closest non-equalPossible primitives needed:
Device-callable binary search that can work with custom functors. This will allow us to interface implementations of backward/forward/nearest
with template dispatch to support tolerance/nearest
for primitive types
Please add merge_asof to cudf to match pandas merge_asof capabilities. Thanks!