Integrate memory mapping

Hi @mchinen !

I've been working on a solution to the high peak memory consumption that we previously identified.

Description

An approach that I felt would work would be to utilize Boost's memory mapping functionality to offload some of the large matrices to the filesystem, rather than keeping them in RAM.

I implemented a generic, vector backed approach to this - called mmd-vector. It's a simple RAII implementation that manages the lifecycle of the memory mapping, while exposing the mapped memory for use. I created a separate library for this as it is a generic approach that has uses outside of ViSQOL.

The Armadillo matrix class exposes a constructor that takes a pointer to auxiliary memory. The mapped memory exposed in the mmd-vector library can therefore be provided to construct a memory mapped matrix.

I was selective in what matrices to map, choosing to map a number of particularly large matrices created during the peak of memory consumption - thereby reducing the impact to execution speed and filesystem I/O.

As a caveat, this was tested on an Ubuntu machine with an SSD. I have not yet tested this on Mac or Windows, or with a HDD. For this reason, I have turned off memory mapping by default, but added a feature flag --use_memory_mappingwhich can be used to enable it.

Results

I ran some massif memory profiling and time tests (using the guitar48_stereo.wav and guitar48_stereo_64kbps_aac.wav as inputs) and the results were really good!

Memory Consumption

Memory consumption fell dramatically, falling from 193.5MB at peak...

    MB
193.5^ #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #                                                                      
     | #:::::::::::::::::::@:::::::::::::::::::::::::::::::::@::::@:::::@::@:@
     | #:: : : ::::: : ::: @: ::: : : :: :: :: ::::: : ::: ::@::: @: :: @::@:@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   482.1

... to 66.21MB:

    MB
66.21^##                                                                      
     |#                                                                       
     |#                                                                       
     |# :                                                                     
     |# :                                                                     
     |# :                                                                     
     |# :                                                                     
     |# :                                                                     
     |# :                                                                     
     |# :                                                                     
     |# :                                                                     
     |# :                                                                     
     |# ::::::@@:::@@::::::::@@:::::::::::::::@@:: @::::::::::::::::::::  ::@:
     |# ::::: @ :::@ ::: : : @ :: :::::: : : :@ :::@::: :: : :: :: : : :::::@:
     |# ::::: @ :::@ ::: : : @ :: :::::: : : :@ :::@::: :: : :: :: : : :: ::@:
     |# ::::: @ :::@ ::: : : @ :: :::::: : : :@ :::@::: :: : :: :: : : :: ::@:
     |# ::::: @ :::@ ::: : : @ :: :::::: : : :@ :::@::: :: : :: :: : : :: ::@:
     |# ::::: @ :::@ ::: : : @ :: :::::: : : :@ :::@::: :: : :: :: : : :: ::@:
     |# ::::: @ :::@ ::: : : @ :: :::::: : : :@ :::@::: :: : :: :: : : :: ::@:
     |# ::::: @ :::@ ::: : : @ :: :::::: : : :@ :::@::: :: : :: :: : : :: ::@:
   0 +----------------------------------------------------------------------->Gi
     0                                                                   493.0

Execution Time

Additionally, execution time increased negligibly from 10.594s without memory mapping...

Execution Time Without Memory Mapping (193.5MB)

  ViSQOL conformance version: 310
  Audio mode
  MOS-LQO:      4.51232

  real  0m10.594s
  user  0m10.404s
  sys   0m0.140s

to 10.673s with memory mapping:

Execution Time With Memory Mapping (66.21MB)

  ViSQOL conformance version: 310
  Audio mode
  MOS-LQO:      4.51232

  real  0m10.673s
  user  0m10.239s
  sys   0m0.416s

Looking forward to hearing your thoughts on this changeset! I know there is a lot of code here, so reviewing it will take some time. But I would be curious to hear your initial, high-level thoughts on the approach.

Thanks, Feargus

Thanks Feargus for this PR, it looks like you spent considerable time on it which I really appreciate! It seems like an interesting change that can indeed save some memory.

Before I comment on the specific code lines there are some issues I can see at a high level, some from first principles and some of logistical nature.

To make maintaining and reading code easier, we strongly prefer to avoid code duplication - when we want to update some algorithm fft/xcorr/envelope, ideally we only need to do it in one place. The changes in fft_mmd/xcorr_mmd/envelope_mmd duplicate the implementation of all the functions. I'm not sure if abstracting envelope/fft/xcorr to interface+inheritance or a container would resolve the issue here, but maybe.
There are some logistical and google-specific things we have to consider because we want to keep the current version of ViSQOL at github and within Google internally the same
- The boost usage may be an issue, although I need to look more into this. Currently boost::interprocess is not in the allowed list at https://google.github.io/styleguide/cppguide.html#Boost.
- We'll have to also review and import your repo into our internal code, since it's a brand new repo.

These are just some initial things for us to think about. I think I need to do research for the second point, but I'd be curious to hear your thoughts on the first point.

Incidentally, I've been thinking about other avenues for reducing memory consumption: the memory usage that is most problematic is when doing a large batch of files - I think the memory/data structures are not released after each file is processed.

This change may ameliorate this, but they could also be interesting to address the root cause.

Hey Michael!

Thanks for taking a look! Agreed that there is a bunch of stuff to be improved here - I wanted to get it in front of you and see if you had any thoughts or objections to it at a conceptual level before really investing time in getting it polished. And also use this PR as a venue to bat out any ideas on it.

Also agreed that the duplication is not ideal and is something that I would be looking to remove. I initially went down the route of updating the code in place but it was both trickier to get it right for a POC and much harder to review. So I implemented it as a distinct branch in the code so that it would be easier to see what is going on. I actually could have explained that thinking better in my opening PR. If you were happy to go ahead with this change at a conceptual level, I'd love to dig in and rework it to remove this duplication.

Those two points you raised from a Google perspective are interesting and something I hadn't considered. Particularly with Boost, I had assumed it was approved in its totality since we were using it already; I didn't realise each component library was considered in isolation. The theory listed in the style guide that some Boost libraries encourage coding practices which can hamper readability does make sense. However, in one sense, the mmd-vector library abstracts the underlying interprocess library away in a way that removes any impact on readability. But that, and the importing of a new repo, is absolutely your call!

Your point about the memory consumption for batch files is interesting. The batch functionality is very likely something I could have given more TLC while working on the project! If there is a resource leak I'd love to take a look at fixing that; unless you already have a PR in the works for it?

In terms of fixing the batch leak and this memory mapping change here, the two would compliment each other. Once we fix the batch leak, this change would still introduce the additional 66% reduction in memory consumption demonstrated here.

Looking forward to hearing your thoughts! Feargus

The thing that I am more interested in after looking over the underlying mapping is the actual gain here. Most systems support virtual memory, which is also disk-backed memory. If the system runs out of ram and uses disk, is the boost-mapping simply hiding the memory allocation from the system's memory trackers that aren't aware of the boost mapped swap space? It seems the main difference is the MmdVector will prioritize disk swap over RAM. I'm unclear on how useful that is. Maybe you have already thought about this and have some opinions.

I think there are a couple of reasons to prefer this approach over a reliance on swap space. The key reason is that this is a targeted approach. ViSQOL instantiates many matrices, but the vast majority of them are relatively small. It is during the initial processing phase that a few very large matrices are created and it is just these few that are memory mapped. This is why execution time doesn't suffer. There is also the general, system-wide performance degradation seen when RAM is exhausted and swap-swap is being heavily relied upon - which this solution avoids.

To me this approach is akin to processing a large file on disk by streaming chunks, rather than reading the full thing into memory and relying on swap space to allow for this.

I'm still unsure about importing the new repo but my guess is that as long as there are no appropriate existing or more popular alternatives, we would be able to import it, although it takes a bit of time. So that's probably not a blocker that we need to sort out before these more conceptual issues.

With regard to the existence of alternative libraries, before implementing my library I looked for alternatives but could not find anything that offered the needed functionality in a manner that allowed for the integration with the armadillo auxiliary memory constructor. There were indeed a number of libraries that offered similar functionality to the Boost memory mapping library, but they would of course require the same wrapping to be utilised.

I appreciate the thoughts on batch files. I think we currently don't have a memory leak, just we hold on to the memory for far too long (we should write the results to disk or screen after each batch element instead of waiting till all elements are processed). I haven't touched that yet.

Ah ok. That makes sense. I will start looking into this.

Thanks, Feargus

Hi Feargus,

Thank you for the explanations. After reviewing it, it doesn't seem obvious to me that it will benefit users in practice, since most users have enough RAM and swap space for normal usage, and there may be negative performance implications for some systems (e.g. with slow disk). Given the complexity of the changes and extra deps and the rework you would need to do, I'm leaning towards recommending not pursuing this change unless there are some real use case benefits or requests for it. I did consider if it would be useful in batch mode. It's possible there is some performance increase there due to the prioritization, but I think it would be more like a band-aid; the cost of virtual memory is acceptable there until we address the root cause of holding on to too much memory for longer than we need to.

I'm very grateful and aware that you took care to make this PR, and that you might have additional thoughts or results that I hadn't considered. Please let me know what you think.

Hi @mchinen ,

Following your points re. the "normal usage" of visqol being short files (in which memory consumption is not that large either way), I agree that the added complexity of integrating this memory mapping is not worth it - so I will close this off.

I appreciate your detailed notes on this PR!

All the best, Feargus

google / visqol