gowerc / diffdf

DataFrame Comparison Tool
https://gowerc.github.io/diffdf/
Other
42 stars 5 forks source link

Convert backend to use data.table #117

Open gowerc opened 1 month ago

gowerc commented 1 month ago

One of the main changes we added in the now abandoned v2 branch was the use of data.table. If memory serves me right this resulted in a ~30% speed up mostly due to the improved merge performance over base R. My gut feeling is this still still worth integrating.

gowerc commented 1 month ago

@kieranjmartin - Assuming you still agree with this idea do you think its worth including this into 1.1 as in theory it shouldn't result in any user facing changes

gowerc commented 1 month ago

@kieranjmartin

I have an initial implementation on the 117-data-table branch.

Some benchmarks of performance on comparing 2 completely different datasets with 1,000,000 rows by 61 columns (10 columns of each core datatype)

Current master branch implementation:

14.522
15.112
14.077
13.786
13.670

mean = 14.2334

data.table implementation:

 8.952
 8.920
10.029
 9.139
 8.794

mean = 9.1668

So roughly ~ 36% performance improvement which I think is consistent with what we saw last time we tried to make this change.

I will say the implementation is awkward... data.table has many weird quirks to it. I managed to get all the tests passing but I wouldn't be surprised if there are edge cases I haven't handled properly yet...

kieranjmartin commented 1 month ago

So assuming we always change it back to tibble before returning the comparison object, the biggest downside of these changes are for users who want to extend diffdf. They will probably need to have to interact with data.table, and as you say if you are not familiar with it its basically a completely different language, both in terms of syntax, and the behaviours of data.table objects when copied or modified.

The other worry is if as.data.table might destroy certain properties of the data; column attributes, list columns, other esoteric behaviours a user might have set up.

gowerc commented 1 month ago

@kieranjmartin sorry your intent with your comment wasn't clear to me. Are you advising against making this change then ?

kieranjmartin commented 1 month ago

well, I'm just discussing them for now. I am not sure to be honest... I am trying to think whether a 30% increase is worth the risk we take on with regards to internal complexity, and possible data fidelity issues. The latter may be a non issue of course, as.data.table may be extremely faithful

gowerc commented 1 month ago

I don't really know data.table well enough to comment, though I think some level of information loss is likely acceptable. As an example tibble strips off row names which is a loss of information but no one ever really seems to care about that. I would be utterly amazed if any of the major stuff was lost e.g. attributes / name / type / mode / class / values which is all we really care about ? I mean for sure we can do some informal testing to check all of those components are unaffected.

The internal complexity does worry me a bit, data.table has a lot of weird edge cases and non-intuitive syntax (if you aren't already user to it). That being said 30% performance gain does seem pretty respectable to me. For a 3mil row labs data set thats a run time of 27 seconds instead of 42 second, at least that does make a difference to UX of the package.

gowerc commented 3 weeks ago

Playing around with data.table and going through the docs I think the following covers the most common cases to be aware of:

library(data.table)

df1 <- data.table(
  id = c(1, 2, 3, 4, 5),
  name = c("A", "B", "C", "D", "E"),
  age = c(21, 22, 23, 24, 25)
)

# Basic assignment is pass by reference
df2 <- df1

# Updates both df1 and df2 (assignment in place)
df1[1, id := 99]
set(df1, 2L, 1L, 88)

# Only updates df1 as the `<-` operator used for
# "subassignment" causes a copy to be created
# that df1 and df2 are now independent objects
df1$id[1] <- -99

# Pass by value (copy)
df2 <- copy(df1)

# Functions are pass by reference
runme <- function(dt) {
    dt[, runmeCol:= 1234]
}
runme(df1)
df1

# Filtering is always done by copy e.g. the following
# doesn't change the value of df1
df1[id > 10, ]

# Pretty much all `set*` functions are done in-place
df1 <- data.table(
    id = c(1, 2, 3, 4, 5),
    age = c(21, NA, 23, NA, 25)
)
setnafill(df1, fill = 0)
setindex(df1)
setkey(df1, id)
setcolorder(df1, "age")   # permutes columns
setorder(df1, age)        # Sort rows
setnames(df1, "id", "ID") # Rename column

I think general takeaway is just don't use set* or := unless you've used copy() and you should be good.

In terms of data/attributes I created a data.frame containing one of every type and then casted it in multiple directions:

x <- generate_test_data(5, 1)  # Returns a "tibble"

y <- as_tibble(as.data.table(x))
all.equal(x, y)

z1 <- as.data.frame(x)
z2 <- as.data.frame(as.data.table(y))
all.equal(z1, z2)
> attr(y, ".internal.selfref")
<pointer: 0x7fb14c80bae0>

My gut feeling is that as long as functions are calling copy() then we should be fine to use data.table.