Closed geoHeil closed 5 years ago
leftJoin takes a "key" argument that allows you to specify the secondary join key (equality only)
Thanks.
one fighter question: when using an interval which is rather large (i.e. multiple values from the right fall into the interval from the left the join will only join the first record. Which means the distinct I previously used is not required.
leftTs.leftJoin(rightTS, tolerance = "1s", key = Seq("group")).toDF.show
+-------+-----+------+------+
| time|group|valueA|valueB|
+-------+-----+------+------+
|1000000| 1| 0.1| 11|
|1000000| 3| 0.3| 13|
|2000000| 1| 0.2| 12|
|2000000| 3| 0.4| 14|
+-------+-----+------+------+
is looking better
leftTs.leftJoin(rightTS, tolerance = "1hour", key = Seq("group")).toDF.show
+-------+-----+------+------+
| time|group|valueA|valueB|
+-------+-----+------+------+
|1000000| 1| 0.1| 11|
|1000000| 3| 0.3| 13|
|2000000| 1| 0.2| 12|
|2000000| 3| 0.4| 14|
+-------+-----+------+------+
I believe this is also stated in the documentation:
leftJoin A function performs the temporal left-join to the right TimeSeriesRDD, i.e. left-join using inexact timestamp matches. For each row in the left, append the most recent row from the right at or before the same time. An example to join two TimeSeriesRDDs is as follows.
How can I join not only by time but also by a column?
Currently, I get:
Found duplicate columns
, but I would like to perform the time series join per group.fails due to duplicate columns.
When renaming the columns:
a cross join is performed between each group and time series. that needs to be manually reduced.
Is there any functionality to perform this type of join more efficiently / built in?