twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
993 stars 184 forks source link

"joinLeft" returns dataframe that smaller than left_df #78

Open GBJim opened 4 years ago

GBJim commented 4 years ago

Hi all:

I found that leftJoin generates df that smaller than the left df

[In] [1]:  joined_flint = left_flint.leftJoin(right_flint, tolerance=tolerance, key=by)  
[In] [2]:  print (joined_flint.count() < left_flint.count())
True

I consider this is a false result since left join does not drop any row in the left table. Any explanation or suggestion?

placeybordeaux commented 3 years ago

leftJoin doesn't have the same semantics as sql left join.

I thought this was really confusing as well.

futureLeftJoin A function performs the temporal future left-join to the right TimeSeriesRDD, i.e. left-join using inexact timestamp matches. For each row in the left, appends the closest future row from the right at or after the same time.

This means that if there were no rows with the matching key within the tolerance it will won't return any rows for that instance.