deduplicate_labels() is painfully slow on large datasets

hans-ekbrand commented 3 years ago

Thanks for closing the issue with duplicate labels. It works, but unfortunately it is very slow on large data sets. The time spent on importing data is about 100 times longer with deduplicate_labels() than without. My guess is that the implementation could be improved.

Beware that the test file is big: 1.7 GB, and that it will take almost 4 hours to run deduplicate_labels() on it.

myurl <- "http://hansekbrand.se/temp/test_deduplicate.sav"
z <- tempfile()
download.file(myurl,z,mode="wb")
my.meta.data <- spss.system.file(z)
## File character set is 'UTF-8'.
## Converting character set to the local 'utf-8'.
## Warning message:
## 1 variables have duplicated labels:
##   SHDISTRI 

####  The next step takes almost 4 hours on my machine
fixed.meta.data <- deduplicate_labels(my.meta.data)

Importing a subset of the file without running deduplicate_labels() takes only a few minutes.

my.subset <- c("HHID", "HVIDX", "HV000", "HV001", "HV002", "HV005", "HV006", 
"HV007", "HV009", "HV013", "HV014", "HV016", "HV024", "HV025", 
"HV028", "HV201", "HV204", "HV205", "HV207", "HV208", "HV209", 
"HV210", "HV211", "HV212", "HV213", "HV214", "HV215", "HV216", 
"HV221", "HV225", "HV226", "HV227", "HV228", "HV230A", "HV236", 
"HV237", "HV239", "HV241", "HV242", "HV243B", "HV243C", "HV243D", 
"HV244", "HV245", "HV246", "HV247", "HV271", "SH36", "HV101", 
"HV104", "HV105", "HV106", "HV108", "HV111", "HV112", "HV113", 
"HV114", "HV140", "HC60")
names(my.subset) <- my.subset
my.ds <- subset(my.meta.data, select = my.subset)
my.df <- as.data.frame(within(my.ds, {
    missing.values(HV112) <- c("Mother not in household")
    }))

Is there a way to speed up deduplicate_labels()?

melff commented 3 years ago

deduplicate_labels() is not intended to a applied on importer objects. I suggest you use the function only after loading the data usint subset() ord as.data.set(). Anyway, your website seems to be down at the moment, which precludes me from reproducing and debugging the problem. Could you make the data available to me?

hans-ekbrand commented 3 years ago

Thanks for your rapid response! Your advice was spot on.

melff / memisc

deduplicate_labels() is painfully slow on large datasets #53