Open quaquel opened 1 year ago
A benefit of the longer-term solution would also be that an error in the experiments (due to an edge case, divide by zero, etc) doesn't mean you have to re-do all experiments.
I ran a quick test using memray. In my test case, I went from 2.6 GB to 1.9 GB. This is close to a 30% reduction in memory usage. It thus seems that creating a directory, writing all results to this directory, and turning this directory into a tarball is an easy way of getting quite some performance improvement.
With #299 we get even better support for running on HPC. However, the existing way in which results are stored does not scale well once going to a very large number of experiments or when creating high dimensional data. Presently, the results are stored as a collection of CSVs wrapped in a tarball. The main advantage of this is that the results are easy to unzip and open with any text editor or even Excel. It is also a very convenient way of storing results in a cross-platform, cross-language way. However, it breaks with large outputs because you will run into memory errors.
A short-term solution is to change
save_results
. It currently builds up the entire tarball in memory before flushing it to disk. A slightly more memory-efficient solution is to create a directory on disk, write each CSV file to it, and then turn the entire directory into a tarball. Some memory profiling is likely needed as to how much of a difference this will make.A longer-term solution is to add other storage solutions where results are flushed to disk while they are coming in. This avoids having to build up in memory the very large results dataset. The basic machinery for this is in place because of the callback keyword argument that is passed to
perform_experiments
. It requires, probably, however, a minor rethink of how to capture the serialization of all classes of outcomes (i.e.,to_disk
andfrom_disk
). Depending on the chosen storage solution, a slightly different serialization will be required.