Performance issue adding new colums to a DataFrame

rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra

https://rformassspectrometry.github.io/Spectra/

34 stars 24 forks source link

Performance issue adding new colums to a DataFrame #174

Open jorainer opened 3 years ago

jorainer commented 3 years ago

Using a DataFrame in most backends to hold the data brings also some performance losses. Subsetting is not ideal but adding or replacing columns is even worse:

library(S4Vectors)
library(microbenchmark)
df <- data.frame(a = 1:1000, b = "b")
DF <- as(df, "DataFrame")
microbenchmark(df$d <- 5, DF$d <- 5, cbind(DF, d = 5))
Unit: microseconds
             expr    min      lq     mean  median      uq    max neval cld
        df$d <- 5   15.3   17.45   24.960   23.65   26.60   67.6   100 a  
        DF$d <- 5 2122.8 6596.00 7152.066 6994.55 7413.55 9968.1   100   c
 cbind(DF, d = 5) 1223.6 1298.40 1548.414 1381.40 1566.15 3930.9   100  b

$ on a DataFrame is very slow, cbind is already better but nothing beats the data.frame. This becomes a real bottleneck if we're e.g. adding columns in a loop, so we should check if there is a better way to add or replace data in a DataFrame.

pinging @lgatto @sgibb - maybe you have already a solution for this?

lgatto commented 3 years ago

Indeed, I suppose using the more complex data structure comes at a cost when the simpler data.frame would work too. Adding columns in a loop might simply not be the way ahead here. If possible, cbinding two DataFrame would be the sensible option here:

> DF10 <- DataFrame(x1 = 1:100, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 1, x7 = 1, x8 = 1, x9 = 1, x10 = 1)
> microbenchmark(cbind(DF, x = 5), cbind(DF, DF), cbind(DF, DF10))
Unit: microseconds
             expr      min        lq      mean   median       uq      max neval
 cbind(DF, x = 5) 1106.832 1119.0675 1157.4790 1131.755 1160.222 2144.252   100
    cbind(DF, DF)  841.286  847.5270  876.2291  853.081  871.633 1913.417   100
  cbind(DF, DF10)  867.546  873.4325  900.7079  879.722  906.423 1891.465   100

Not sure if that helps though.

bioc-devel has been very helpful in such situations.

ococrook commented 3 years ago

I also have performance issues here, so following this.

lgatto commented 3 years ago

Depending on your application, there's also joinSpectraData() that I use regularly, albeit not yet with very large data sets.

ococrook commented 3 years ago

yep, I'm using that to join the quant MS data with the ID MS data - its very good but still slow

lgatto commented 3 years ago

How slow?
Could you do some profiling?
Have you tried to compare it against a cbind() where you have taken care of matching/subsetting the data?

ococrook commented 3 years ago

Faster than cbind definitely. I'll do some profiling