tbeu / matio

MATLAB MAT File I/O Library
https://matio.sourceforge.io
BSD 2-Clause "Simplified" License
334 stars 97 forks source link

Loading a 100Mb .mat file produces peak RSS of 20Gb #55

Closed vadimkantorov closed 6 years ago

vadimkantorov commented 7 years ago

I'm using matio 1.5.10 and matio-ffi.torch. I have a 100Mb file that makes matio to allocate suspiciously a lot of memory:

du -h SelectiveSearchVOC2007trainval.mat.edgeboxes.mat
# 93M     SelectiveSearchVOC2007trainval.mat.edgeboxes.mat

/usr/bin/time -f %M th -e '(require "matio").load("SelectiveSearchVOC2007trainval.mat.edgeboxes.mat")' 
# 20407944 KiB

Probably I'm missing something obvious, but such memory consumption seems a little fishy to me. Doing the same with matdump gives:

/usr/bin/time -f %M matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 5102308

Does this discrepancy of 5Gb vs 20Gb mean matio-ffi.torch is using matio sub-optimally?

log.txt contains substring Empty many times:

grep Empty log.txt | wc -l
# 50210218

wc -l log.txt
# 50230273 log.txt

File uploaded to my OneDrive: https://1drv.ms/u/s!Apx8USiTtrYmprRlRQmgSbPJNcWzEw

tbeu commented 7 years ago

There are 2 variables with 5011x5011 cells. Each cell allocates 80Bytes for the matvar_t struct, no matter if the cell is empty or not. Thus 2(50115011 + 1)*80 = roughly 3.8GiB get allocated just for the data structures (not yet including the cell data). It is the cells overhead that is causing the high memory comsumption.

As a hack, the empty cells can be freed by

diff --git a/src/mat5.c b/src/mat5.c
index 075d3d2..cd281f6 100644
--- a/src/mat5.c
+++ b/src/mat5.c
@@ -1792,6 +1792,8 @@ ReadNextCell( mat_t *mat, matvar_t *matvar )
             nbytes = uncomp_buf[1];
             if ( !nbytes ) {
                 /* empty cell */
+                Mat_VarFree(cells[i]);
+                cells[i] = NULL;
                 continue;
             } else if ( uncomp_buf[0] != MAT_T_MATRIX ) {
                 Mat_VarFree(cells[i]);

which helps in your case. But technically it is no longer the same and I do not feel comfortable to commit this change. I'd rather recommend you to get rid of the high-dimensional cell array.

tbeu commented 7 years ago

Just noticed that if MATLAB reads such a cell array with empty cells, it does not allocate the usual array header (overhead). Hence, I will think about the hack and its consequences for e.g., Mat_VarSize or Mat_VarPrint.

vadimkantorov commented 7 years ago

Thanks for very fast response!

Now I see, I would strip the extra empty dimensions. Somehow boxes{1} still returns the cell value, while boxes{1, 1} returns an empty cell (boxes is one of the two variables).

The issue might still be troublesome in adversarial / DoS setting.

tbeu commented 7 years ago

Can you please test and confirm that 464de5c1a2dcc611849b3b39701e3df682234be6 also solves the issue for matio-ffi.torch. Thanks.

vadimkantorov commented 7 years ago

matio-ffi.torch still eats 4x times more memory (matdump peaks at 400 Mb, matio-ffi.torch at 1600 Mb), but it's no longer a problem for my case. I guess to fix it, one then needs to get deeper into matio-ffi.torch. Thanks for the quick patch!

tbeu commented 7 years ago

Do you know if function load of matio-ffi.torch only loads the variable info or the complete variable including data from a variable? In the latter case you would need to compare with matdump -d that also retrieves the data (and doubles the memory consumption of your file).

/usr/bin/time -f %M ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 391452
/usr/bin/time -f %M ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 779612
tbeu commented 7 years ago

I also noticed that the testsuite currently misses struct/cell arrays with empty fields or cells. I'll add these cases.

vadimkantorov commented 7 years ago

It definitely loads the data. Even more, from what I can see, matio-ffi.torch copies the data, hence potentially one more doubling.

tbeu commented 7 years ago

Finally, this explains the doubled memory consumption of matio-ffi.torch w.r.t. matdump -d. Thanks.

tbeu commented 6 years ago

Performance comparison

1. Matio v1.5.10

77517481d3df2007541974c97740368d9b19c5c5 (last commit before dd1d2cd710c762791a5e5b6d36d5098588a88db0), so bascially matio v1.5.10 (and former)

/usr/bin/time -f "%es %MK" ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 474.62s 3639144K
/usr/bin/time -f "%es %MK" ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 752.46s 3629448K

2. Upcoming matio v1.5.11

dd1d2cd710c762791a5e5b6d36d5098588a88db0 as part of upcoming matio v1.5.11

/usr/bin/time -f "%es %MK" ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 34.50s 2744772K
/usr/bin/time -f "%es %MK" ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 56.12s 3132860K

For memory usage, this simply is some compromise between performance and backward-compatibility. But I was surprised about the observed speed improvements of more than one order of magnitude.

3. Even better

The best values you can get, is, if empty cells are freed again according to the mentioned hack in https://github.com/tbeu/matio/issues/55#issuecomment-284178564. However, such a patch would not be backward-compatible, e.g., API functions like Mat_VarPrint or Mat_VarWrite will result in different output then.

/usr/bin/time -f "%es %MK" ./tools/matdump SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 18.17s 391072K
/usr/bin/time -f "%es %MK" ./tools/matdump -d SelectiveSearchVOC2007trainval.mat.edgeboxes.mat > log.txt
# 40.91s 779360K