scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
303 stars 27 forks source link

anti-join does not work as expected in presence of missing values #77

Closed real-iceman closed 2 years ago

real-iceman commented 2 years ago

If a value of the join column is available in both datasets of an anti-join, no rows for this join column value should appear in the result set. However, they are included if another column (of the left-join) contains a missing-value.

Example from the documentation:

ds1:

:a :b :c
1 101 a
2 102 b
1 103 s
2 104
3 105 t
4 106 r
107 a
108 c
4 109 t

ds2:

:a :b :c :d :e
110 d X 3
1 109 a X 4
2 108 t X 5
5 107 a X 6
4 106 t X 7
3 105 a X
2 104 b X 8
1 103 l X 1
102 e X 1
(tc/anti-join ds1 ds2 :b)

anti-join [5 3]:

:b :a :c
108 c
107 a
105 3 t
102 2 b
101 1 a

I would expect

:b :a :c
101 1 a

because values 102, ..., 108 are all found in column :b of dataset ds2.

Or am I misunderstanding the nature of an anti-join?

genmeblog commented 2 years ago

Oh, thanks, I will verify this today or tomorrow.

genmeblog commented 2 years ago

Yes, I can confirm, it's wrong (also semi-join). I don't know why I haven't verified it from the very beginning :/

Anyway, I'll prepare a patch soon.

genmeblog commented 2 years ago

Fixed in 6.094.1, verify against dplyr (+ created tests)