When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates

rctatman commented 6 years ago

I've run into a strange bug where, when specifying tolerance for dedup(), the number of rows returned is greater than the number of rows in the original dataset:

dim(iris) # 150 rows
dim(iris %>% dedup()) # 149 rows
dim(iris %>% dedup(tolerance = 0)) # 11067 rows
dim(iris %>% dedup(tolerance = 0.2)) #9156 rows
dim(iris %>% dedup(tolerance = 0.4)) # 4627 rows
dim(iris %>% dedup(tolerance = 0.6)) # 2640 rows
dim(iris %>% dedup(tolerance = 0.8)) # 431 rows
dim(iris %>% dedup(tolerance = 1)) # 150 rows

These additional rows are exact duplicates & can be removed with distinct(), but it seems to be unintended behavior.

sckott commented 6 years ago

👋 @rctatman - sorry about the delay, was on vacation, then email notification sank down.

sckott commented 6 years ago

can you reinstall and try again?

rctatman commented 6 years ago

Looks like it's fixed in version 0.1.3.9321! :+1:

ropensci-archive / scrubr

When tolerance is specified, dedup() returns larger dataset than original, with many exact duplicates #27