Closed FeargusOG closed 3 years ago
Thanks Feargus for this PR, it looks like you spent considerable time on it which I really appreciate! It seems like an interesting change that can indeed save some memory.
Before I comment on the specific code lines there are some issues I can see at a high level, some from first principles and some of logistical nature.
These are just some initial things for us to think about. I think I need to do research for the second point, but I'd be curious to hear your thoughts on the first point.
Incidentally, I've been thinking about other avenues for reducing memory consumption: the memory usage that is most problematic is when doing a large batch of files - I think the memory/data structures are not released after each file is processed.
This change may ameliorate this, but they could also be interesting to address the root cause.
Hey Michael!
Thanks for taking a look! Agreed that there is a bunch of stuff to be improved here - I wanted to get it in front of you and see if you had any thoughts or objections to it at a conceptual level before really investing time in getting it polished. And also use this PR as a venue to bat out any ideas on it.
Also agreed that the duplication is not ideal and is something that I would be looking to remove. I initially went down the route of updating the code in place but it was both trickier to get it right for a POC and much harder to review. So I implemented it as a distinct branch in the code so that it would be easier to see what is going on. I actually could have explained that thinking better in my opening PR. If you were happy to go ahead with this change at a conceptual level, I'd love to dig in and rework it to remove this duplication.
Those two points you raised from a Google perspective are interesting and something I hadn't considered. Particularly with Boost, I had assumed it was approved in its totality since we were using it already; I didn't realise each component library was considered in isolation. The theory listed in the style guide that some Boost libraries encourage coding practices which can hamper readability
does make sense. However, in one sense, the mmd-vector
library abstracts the underlying interprocess
library away in a way that removes any impact on readability. But that, and the importing of a new repo, is absolutely your call!
Your point about the memory consumption for batch files is interesting. The batch functionality is very likely something I could have given more TLC while working on the project! If there is a resource leak I'd love to take a look at fixing that; unless you already have a PR in the works for it?
In terms of fixing the batch leak and this memory mapping change here, the two would compliment each other. Once we fix the batch leak, this change would still introduce the additional 66% reduction in memory consumption demonstrated here.
Looking forward to hearing your thoughts! Feargus
The thing that I am more interested in after looking over the underlying mapping is the actual gain here. Most systems support virtual memory, which is also disk-backed memory. If the system runs out of ram and uses disk, is the boost-mapping simply hiding the memory allocation from the system's memory trackers that aren't aware of the boost mapped swap space? It seems the main difference is the MmdVector will prioritize disk swap over RAM. I'm unclear on how useful that is. Maybe you have already thought about this and have some opinions.
I think there are a couple of reasons to prefer this approach over a reliance on swap space. The key reason is that this is a targeted approach. ViSQOL instantiates many matrices, but the vast majority of them are relatively small. It is during the initial processing phase that a few very large matrices are created and it is just these few that are memory mapped. This is why execution time doesn't suffer. There is also the general, system-wide performance degradation seen when RAM is exhausted and swap-swap is being heavily relied upon - which this solution avoids.
To me this approach is akin to processing a large file on disk by streaming chunks, rather than reading the full thing into memory and relying on swap space to allow for this.
I'm still unsure about importing the new repo but my guess is that as long as there are no appropriate existing or more popular alternatives, we would be able to import it, although it takes a bit of time. So that's probably not a blocker that we need to sort out before these more conceptual issues.
With regard to the existence of alternative libraries, before implementing my library I looked for alternatives but could not find anything that offered the needed functionality in a manner that allowed for the integration with the armadillo
auxiliary memory constructor. There were indeed a number of libraries that offered similar functionality to the Boost memory mapping library, but they would of course require the same wrapping to be utilised.
I appreciate the thoughts on batch files. I think we currently don't have a memory leak, just we hold on to the memory for far too long (we should write the results to disk or screen after each batch element instead of waiting till all elements are processed). I haven't touched that yet.
Ah ok. That makes sense. I will start looking into this.
Thanks, Feargus
Hi Feargus,
Thank you for the explanations. After reviewing it, it doesn't seem obvious to me that it will benefit users in practice, since most users have enough RAM and swap space for normal usage, and there may be negative performance implications for some systems (e.g. with slow disk). Given the complexity of the changes and extra deps and the rework you would need to do, I'm leaning towards recommending not pursuing this change unless there are some real use case benefits or requests for it. I did consider if it would be useful in batch mode. It's possible there is some performance increase there due to the prioritization, but I think it would be more like a band-aid; the cost of virtual memory is acceptable there until we address the root cause of holding on to too much memory for longer than we need to.
I'm very grateful and aware that you took care to make this PR, and that you might have additional thoughts or results that I hadn't considered. Please let me know what you think.
Hi @mchinen ,
Following your points re. the "normal usage" of visqol
being short files (in which memory consumption is not that large either way), I agree that the added complexity of integrating this memory mapping is not worth it - so I will close this off.
I appreciate your detailed notes on this PR!
All the best, Feargus
Hi @mchinen !
I've been working on a solution to the high peak memory consumption that we previously identified.
Description
An approach that I felt would work would be to utilize Boost's memory mapping functionality to offload some of the large matrices to the filesystem, rather than keeping them in RAM.
I implemented a generic,
vector
backed approach to this - calledmmd-vector
. It's a simpleRAII
implementation that manages the lifecycle of the memory mapping, while exposing the mapped memory for use. I created a separate library for this as it is a generic approach that has uses outside of ViSQOL.The Armadillo
matrix
class exposes a constructor that takes a pointer to auxiliary memory. The mapped memory exposed in themmd-vector
library can therefore be provided to construct a memory mapped matrix.I was selective in what matrices to map, choosing to map a number of particularly large matrices created during the peak of memory consumption - thereby reducing the impact to execution speed and filesystem I/O.
As a caveat, this was tested on an Ubuntu machine with an
SSD
. I have not yet tested this on Mac or Windows, or with aHDD
. For this reason, I have turned off memory mapping by default, but added a feature flag--use_memory_mapping
which can be used to enable it.Results
I ran some
massif
memory profiling andtime
tests (using theguitar48_stereo.wav
andguitar48_stereo_64kbps_aac.wav
as inputs) and the results were really good!Memory Consumption
Memory consumption fell dramatically, falling from
193.5MB
at peak...... to
66.21MB
:Execution Time
Additionally, execution time increased negligibly from
10.594s
without memory mapping...to
10.673s
with memory mapping:Looking forward to hearing your thoughts on this changeset! I know there is a lot of code here, so reviewing it will take some time. But I would be curious to hear your initial, high-level thoughts on the approach.
Thanks, Feargus