Performance of write.table

sda030 commented 7 years ago

I noticed that you use as.data.frame before write.table (see here) for both normal and imputed datasets. Any reason behind this? As Mplus only takes numerical data (or so I believe), you could consider as.matrix instead, which apparently would be a 6x speed improvement, without having to depend on other packages. (matrices can still have colnames, should you need it)

JWiley commented 7 years ago

I think the reason is that a lot of the code is written for data frames and lists. We loop through variables with lapply(), for example. That doesn't work with matrices. There are alternatives, of course, so its not that it cannot be written with matrices, but that it is not currently. If the hangup is solely at the write.table() phase, we could do a pre-conversion in R to a matrix immediately before writing to the disk, along the lines: write.table(as.matrix(data))

If that makes a meaningful difference I can make that change pretty quickly.

I'd be interested to see any data on how much time is spent on data writing in any practical application.
I often work with datasets of thousands of variables, but any given Mplus model normally is on say 4 - 40 variables, and a few thousand cases, so even inefficient reading / writing ends up being pretty trivial. I'd guess about 95% of my compute time is spent on model estimation right now, but as I said, I'd like to see other examples. If you or others could get any examples where the time spent in R is non-trivial compared to the time of Mplus estimating the model, I would be open to considering a re-write of some of the internals. I'd be tempted to move to the data.table (preferred) or dplyr (less preferred) universe, and create a separate function that takes the original data and stores it specially. This step could do type checking, etc. and enforce only numeric or integer types with factors and logicals coerced and characters a warning. Several other steps we do could be faster in data.tables as well, and the fwrite() function can have ~30x speed up versus write.table(). An added benefit for anyone already using data.tables is that we could reduce the memory footprint as they could be passed by reference and we'd only really need to write out the relevant columns. That said, this would be a fairly large undertaking, for no new features so really needs a good case that it would result in performance gains.

JWiley commented 6 years ago

@sda030 I've pushed changes that should make this considerably faster. I've also begun some backend changes that over time may allow matrices to be directly passed and utilized as is for greater speed and efficiency, if that is how your data are stored.

cjvanlissa commented 6 years ago

Dear Joshua, I've brought up the issue of efficiency with Michael before. My models are often pretty large, and MplusAutomation routinely takes between 10 seconds (for random intercept cross-lagged panel models) and 10 minutes (for Bayesian DSEM models) to read the output. I'd be interested to contribute to a project to improve the efficiency of the code.

JWiley commented 6 years ago

@cjvanlissa that sounds pretty painful for the Bayesian DSEM. I know there's places we could speed up.
I'd suggest the next steps are:

A dataset of how many seconds it takes to read many different models, on the same machine (perhaps rely on the Mplus User Guide Examples?)
Based on the data, profile a couple of the worst offenders to see where the most time is spent in MplusAutomation code.
Plan out and edit / re-write code as appropriate.

Given time constraints, I'm not in a position to make promises about how much we can improve right now, but if you could work on a timings dataset and give me a couple output files, I could at least profile the code. That should be enough to give us some sense about what we are dealing with. My strong hope is that there are a few bottlenecks and we could focus on improving those. If we find out that no few functions are to blame, but its just a cumulation of time across almost all of the readModels process, then I'm not sure how best to proceed. Its a pretty bearish code base to try to overhaul completely.

Another place to potentially help would be building a database of output files AND some "true" information about them that could be used to build a more robust test suite. One of my big worries about messing around with the code base is that we do not have a robust set of tests, which makes it difficult to know if code breaks anything. To build these tests we need: (1) output files and (2) maybe a data frame or something of verified "true" results, so that we could have code that compares what we read in with the truth and gives an error if they deviate. Enough tests like that and we could feel more confident making bigger changes.

michaelhallquist commented 6 years ago

Hi all,

Just to add onto Josh's comment, I have done some profiling of the codebase, particularly for readModels (and its subsidiary functions). Long story short, I tried to optimize some of the biggest offenders in v0.7-1 of the package, though this only improved speed by about 10% in most cases. I mention this mostly to say that the 'easy stuff' in terms of optimization has likely already been covered.

My hunch is that your super-long read time on the DSEM may have to do with importing large datasets associated with the output file, particularly the output of BPARAMETERS. It's possible we may be able to speed these up substantially be switching away from base R to readr or data.table import functions that are optimized for speed.

Caspar, would you be able to provide the model that takes 10 minutes? If the files are too large to upload here, feel free to email a link to me directly.

Thanks, Michael

cjvanlissa commented 6 years ago

I'm happy to provide one of these output files, but I don't think I have one on hand now... the estimation also takes an extremely long time.

Regarding structural solutions, I thought about the following:

Split the entire output file by multiple spaces in C++
Return an object with the split lines
Regex in R to identify line numbers for sectionheaders
Split the object from step 2 at the sectionheaders

Sections could also be split based on the the number of entries per line in the object from step 2; consecutive lines with an equal number of entries are usually (I believe always) a sub-section of the output.

michaelhallquist / MplusAutomation

Performance of write.table #54