Marian 1.9.0 requires roughly 30-100% more CPU memory than 1.7.6 in GPU decoding

frzme commented 4 years ago

The RSS of Marian 1.9.0 seems to be roughly twice as high as that of 1.7.6 (02f4af4eeefa79a24cd52d279a5d4d374423d631)

We are running multiple instances of marian-server on a machine with 16GB of RAM and a Nvidia T4 GPU with 1.9.0 it is no longer possible to run the same amount of instances. All instances are configured with a RNN translation model.

Output of ps aux for 1.9.0 looks like this

mt          17  0.1  7.0 8739420 2281416 ?     Sl   14:47   0:06  /marian/marian-server -c model/config.yml --port 8080 -w 256
mt          29  0.1  3.7 7609444 1221032 ?     Sl   14:47   0:03  /marian/marian-server -c model/config.yml --port 8081 -w 256
mt          41  0.1  4.0 7877104 1317996 ?     Sl   14:47   0:04  /marian/marian-server -c model/config.yml --port 8082 -w 256
mt          53  0.1  3.9 7814612 1284828 ?     Sl   14:47   0:04  /marian/marian-server -c model/config.yml --port 8083 -w 256
mt          65  0.1  3.7 7612860 1226752 ?     Sl   14:47   0:04  /marian/marian-server -c model/config.yml --port 8084 -w 256
mt          77  0.1  6.1 7697944 2010200 ?     Sl   14:47   0:05  /marian/marian-server -c model/config.yml --port 8085 -w 256
mt          89  0.1  4.8 7811908 1566724 ?     Sl   14:47   0:05  /marian/marian-server -c model/config.yml --port 8086 -w 256
mt         101  0.1  4.2 7603044 1381960 ?     Sl   14:47   0:04  /marian/marian-server -c model/config.yml --port 8087 -w 256
mt         113  0.1  6.2 7724384 2031064 ?     Sl   14:47   0:05  /marian/marian-server -c model/config.yml --port 8088 -w 256
mt         125  0.1  4.9 7628728 1622356 ?     Sl   14:47   0:05  /marian/marian-server -c model/config.yml --port 8089 -w 256
mt         139  0.1  4.6 7609540 1498732 ?     Sl   14:47   0:05  /marian/marian-server -c model/config.yml --port 8090 -w 256
mt         151  0.1  3.7 7650484 1233120 ?     Sl   14:47   0:04  /marian/marian-server -c model/config.yml --port 8091 -w 256
mt         163  0.1  5.8 7581884 1896544 ?     Sl   14:47   0:05  /marian/marian-server -c model/config.yml --port 8092 -w 256

While for 1.7.6 it looks like this

mt          29  0.1  5.6 7636852 904680 ?      Sl   03:48   0:19 /marian/marian-server -c model/config.yml --port 8081  -w 256
mt          41  0.1  7.7 7998440 1243596 ?     Sl   03:48   0:22 /marian/marian-server -c model/config.yml --port 8082  -w 256
mt          53  0.1  5.9 7902308 965664 ?      Sl   03:48   0:19 /marian/marian-server -c model/config.yml --port 8083  -w 256
mt          65  0.1  5.4 7636728 872188 ?      Sl   03:48   0:20 /marian/marian-server -c model/config.yml --port 8084  -w 256
mt          77  0.1  4.8 7759832 775840 ?      Sl   03:48   0:20 /marian/marian-server -c model/config.yml --port 8085  -w 256
mt          89  0.1  9.5 7906960 1538212 ?     Sl   03:48   0:19 /marian/marian-server -c model/config.yml --port 8086  -w 256
mt         101  0.1  5.1 7629504 838252 ?      Sl   03:48   0:20 /marian/marian-server -c model/config.yml --port 8087  -w 256
mt         113  0.1  5.7 7772152 932028 ?      Sl   03:48   0:19 /marian/marian-server -c model/config.yml --port 8088  -w 256
mt         127  0.1  5.5 7651396 899736 ?      Sl   03:48   0:21 /marian/marian-server -c model/config.yml --port 8089  -w 256
mt         139  0.1  5.5 7632304 898152 ?      Sl   03:48   0:22 /marian/marian-server -c model/config.yml --port 8090  -w 256
mt         153  0.1  5.6 7709644 916260 ?      Sl   03:48   0:19 /marian/marian-server -c model/config.yml --port 8091  -w 256
mt         165  0.1  5.5 7607724 892860 ?      Sl   03:48   0:20 /marian/marian-server -c model/config.yml --port 8092  -w 256

Notice that for 1.9.0 RSS ranges between 1.2GB and 2GB while for 1.7.6 it ranges between 0.9GB and 1.2GB

Both versions are compiled on identical systems against CUDA 10.1 with MKL and CPU decoding enabled. The instance in the ps output however have cpu-threads set to 0 Is there a reason for the increased memory usage? Could it be decreased again?

emjotde commented 4 years ago

Will take a look. In our production code we see no increase (we are actually monitoring that), but the initialization is a bit different there. If that is indeed the case this might be easy to fix and I have a hunch.

emjotde commented 4 years ago

What's the size and type of your model?

frzme commented 4 years ago

It's a "nematus" type RNN model with ~90k vocabulary size. The model file is ~600MB. I compiled 1.9.0 with a more recent version of Intel MKL, could that make a difference?

emjotde commented 4 years ago

You are using that model on the GPU, right?

frzme commented 4 years ago

Yes. The GPU is also used and the process shows up in nvidia-smi

emjotde commented 4 years ago

Confirmed. I see it, too. Investigating.

emjotde commented 4 years ago

All hail to git bisect :)

@frankseide This is caused by initialization of the cuSparse handle (which is ridiculous) here: https://github.com/marian-nmt/marian/blob/master/src/tensors/gpu/backend.h#L33 What do you think about doing a lazy init for all the handles? So it gets initialized on first usage when using all things factored.

emjotde commented 4 years ago

@frzme Can you just comment out the two lines that mention cusparseCreate/Destroy in that file and check?

frankseide commented 4 years ago

I agree with lazy initialization.

emjotde commented 4 years ago

@frzme you can now try the master branch from https://github.com/marian-nmt/marian-dev

frzme commented 4 years ago

Looks very good! Marian 1.9.1 (https://github.com/marian-nmt/marian-dev/commit/adba021a5e6fee65870d16eae9d88319b07fa9bb)

ps aux | grep marian-server && free -h
mt          16  0.8  5.2 7572348 853528 ?      Sl   10:43   0:03 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8080
mt          28  1.0  5.5 7801620 893924 ?      Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8081
mt          40  0.9  5.9 7633008 955420 ?      Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8082
mt          52  0.8  5.4 7663456 885668 ?      Sl   10:43   0:03 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8083
mt          64  1.0  8.4 7608100 1363004 ?     Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8084
mt          76  1.1  5.8 7537940 935136 ?      Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8085
mt          88  1.2  7.2 7676028 1170376 ?     Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8086
mt         100  1.0  5.6 7446772 910324 ?      Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8087
mt         114  1.0  6.2 7280376 1003104 ?     Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8088
mt         126  1.0  6.7 7329728 1095392 ?     Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8089
mt         140  1.0  7.5 7432484 1213360 ?     Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8090
mt         152  1.0  6.4 7345968 1046788 ?     Sl   10:43   0:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --port 8091
mt         471  0.0  0.0  13216  1048 pts/0    S+   10:50   0:00 grep marian-server
              total        used        free      shared  buff/cache   available
Mem:            15G         12G        238M        296M        2.3G         10G
Swap:            0B          0B          0B

Marian 1.7.6 (https://github.com/marian-nmt/marian/commit/02f4af4eeefa79a24cd52d279a5d4d374423d631)

ps aux | grep marian-server && free -h
mt          17  1.1  5.8 7765224 945228 ?      Sl   03:46   4:38 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8080
mt          29  1.3  6.2 8002348 1004228 ?     Sl   03:46   5:35 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8081
mt          41  1.2  5.9 7856064 960036 ?      Sl   03:46   5:10 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8082
mt          53  1.2  6.2 7887100 1012904 ?     Sl   03:46   5:15 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8083
mt          65  1.2 10.8 7804492 1751732 ?     Sl   03:46   5:05 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8084
mt          77  0.4  6.9 7735904 1120588 ?     Sl   03:46   1:41 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8085
mt          89  0.5  5.2 7871276 841908 ?      Sl   03:46   2:04 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8086
mt         101  0.4  5.7 7644236 924792 ?      Sl   03:46   1:40 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8087
mt         115  0.5  5.4 7562716 876304 ?      Sl   03:46   2:06 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8088
mt         127  0.4  5.2 7530616 844868 ?      Sl   03:46   1:58 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8089
mt         139  0.5  5.4 7626556 877988 ?      Sl   03:46   2:09 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8090
mt         151  0.3  5.1 7578144 823292 ?      Sl   03:46   1:20 /marian/marian-server -c model/config.yml -w 256 --maxi-batch 2 --mini-batch 8 --maxi-batch-sort src --log-level off --port 8091
              total        used        free      shared  buff/cache   available
Mem:            15G         12G        1.0G        296M        1.5G        7.1G
Swap:            0B          0B          0B

Note: I don't know/don't think that we can draw the conclusion that Marian 1.9.1 is using significantly less memory than 1.7.6, but likely not more! Can this change be brought to the stable repo?

Since you were already looking into this: Why is "available" memory so much higher than Free+Cache? Is it caused by memory mapped files?

emjotde commented 4 years ago

Great. Are you OK with using marian-dev for a while? I want to wait a bit if other people report more problems before I do an official release with this fix and potential others. If no one complains say for a week, I can do a release for 1.9.1.

As for cache, CUDA is doing something weird here, it's not caused by Marian. I never had any actual consequences from that, so I treat it as fake. When for instance you use the CPU-only version that effect is gone.

emjotde commented 4 years ago

BTW, @frzme If you use Marian in production, you might want to add your company logo to https://marian-nmt.github.io (bottom)

Corresponding issue: https://github.com/marian-nmt/marian/issues/230

frzme commented 4 years ago

I think I can make using marian-dev work (for a while) - I've requested an approval but am rather confident that it will be possible We have discussed the logo issue internally but unfortunately there seem to be reasons which are above my level of influence that are preventing it from happening (for now?), sorry :(

emjotde commented 4 years ago

I will keep the issue open until I update master here.

Logo, sure thing. I know about big companies :)

patrickhuy commented 3 years ago

Hi, I just tried the 1.10 release and unfortunate it seems like memory requirements have again gone up (compared to 1.9.1). Is that expected? If not do you have a suggestion on what I could do to pinpoint the issue?

emjotde commented 3 years ago

Ah, I was messing around with that code recently. Will take a look, might very well be the same problem. V1.11 should drop this or next week. Will try to include a fix.

emjotde commented 3 years ago

@frzme which commit exactly were you using until now?

patrickhuy commented 3 years ago

@emjotde I changed my github handle in the meantime (I'm the issue creator). I've been using marian-dev 1.9.1 (adba021a5e6fee65870d16eae9d88319b07fa9bb) https://github.com/marian-nmt/marian-dev/commit/adba021a5e6fee65870d16eae9d88319b07fa9bb

When upgrading to 1.10 I also upgraded a lot of other components so I'm not sure if that could have also made a difference (if it makes sense to try something let me know!): Ubuntu1804 -> Ubuntu 2004 CUDA 10.2 -> 11.2 boost 1.65 -> 1.71 intel mkl 2019.1-053 -> 2020.0-088

I also noticed that the 1.10 binary is almost twice the size of the 1.9.1 one (because of more GPU support?) but I don't think that should cause much higher memory utilisation (?)

emjotde commented 3 years ago

It might actually, the binary still has to go into RAM. You can switch off specific GPU types like -DCOMPILE_CUDA_SM80=off, this will soon be renamed to -DCOMPILE_AMPERE=off (v1.11.0). You can also use -DCMAKE_BUILD_TYPE=Slim to get rid of debug symbols etc.

I will check against that revision in the meantime.

patrickhuy commented 3 years ago

I think the actual binary should be shared between multiple running binaries, experiments didn't show benefits in memory usage by switching to type "slim". I tried switching off sentencepiece (became default to on) and all unused CUDA targets. -DCMAKE_BUILD_TYPE=Slim decreases the binary size significantly (down to ~163Mi from ~600Mi at Release configuration). However neither of these changes brought a major improvement in memory usage. We are running ~13 instances of marian-server on a single 16GB Ram GPU enabled node.

Comparing memory usage between 1.9.1 and 1.10 shows that each marian-server instance requires 200~400Mi more "RSS"

1.9.1 (CUDA 10.2)

ps aux | grep marian-server && free -h
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mt            15  0.3  3.2 6182316 516056 ?      Sl   12:59   0:01 /marian/marian-server
mt            27  0.3  3.2 6251644 520308 ?      Sl   12:59   0:01 /marian/marian-server
mt            39  0.3  3.1 6138676 501324 ?      Sl   12:59   0:01 /marian/marian-server
mt            51  0.3  3.0 6110784 485732 ?      Sl   12:59   0:01 /marian/marian-server
mt            65  0.5  4.6 6654728 740688 ?      Sl   12:59   0:02 /marian/marian-server
mt            79  0.3  2.8 6066880 459428 ?      Sl   12:59   0:01 /marian/marian-server
mt            93  0.5  5.4 6666000 877796 ?      Sl   12:59   0:02 /marian/marian-server
mt           108  0.5  4.7 6240916 766528 ?      Sl   12:59   0:02 /marian/marian-server
mt           120  0.5  5.3 6571724 856388 ?      Sl   12:59   0:02 /marian/marian-server
mt           134  0.4  4.3 6151448 704788 ?      Sl   12:59   0:01 /marian/marian-server
mt           146  0.4  5.3 6224868 863296 ?      Sl   12:59   0:02 /marian/marian-server
mt           161  0.4  4.5 6088536 736992 ?      Sl   12:59   0:01 /marian/marian-server
mt           173  0.8  8.7 7423636 1410408 ?     Sl   12:59   0:03 /marian/marian-server
mt           352  0.0  0.0  13216  1112 pts/0    S+   13:06   0:00 grep marian-server
              total        used        free      shared  buff/cache   available
Mem:            15G         11G        181M        131M        3.8G        8.8G
Swap:            0B          0B          0B

1.10 (Release) (CUDA 11.2)

ps aux | grep marian-server && free -h
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mt            16  0.3  5.2 6410792 840620 ?      Sl   12:58   0:01 /marian/marian-server
mt            28  0.5  6.2 6987576 1009812 ?     Sl   12:58   0:02 /marian/marian-server
mt            40  0.5  5.9 6914916 965648 ?      Sl   12:58   0:02 /marian/marian-server
mt            53  0.5  6.0 6908508 966444 ?      Sl   12:58   0:02 /marian/marian-server
mt            66  0.3  5.6 6473144 902716 ?      Sl   12:58   0:01 /marian/marian-server
mt            82  0.5  5.8 6859112 945136 ?      Sl   12:58   0:02 /marian/marian-server
mt            96  0.4  5.6 6483392 913672 ?      Sl   12:58   0:01 /marian/marian-server
mt           108  0.4  5.5 6467912 896480 ?      Sl   12:58   0:01 /marian/marian-server
mt           120  0.3  5.3 6422920 854368 ?      Sl   12:58   0:01 /marian/marian-server
mt           132  0.3  5.0 6380352 805496 ?      Sl   12:58   0:01 /marian/marian-server
mt           149  0.6  6.0 6960196 979292 ?      Sl   12:58   0:02 /marian/marian-server
mt           162  0.3  4.6 6317456 744600 ?      Sl   12:58   0:01 /marian/marian-server
mt           175  0.6  8.5 6943300 1372828 ?     Sl   12:58   0:02 /marian/marian-server
mt           356  0.0  0.0   5192   724 pts/0    S+   13:05   0:00 grep marian-server
              total        used        free      shared  buff/cache   available
Mem:           15Gi        12Gi       193Mi       131Mi       2.2Gi       2.1Gi
Swap:            0B          0B          0B

1.10 (Slim) (CUDA 11.2)

ps aux | grep marian-server && free -h
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mt            16  1.5  5.6 6481732 912608 ?      Sl   13:12   0:01 /marian/marian-server
mt            28  1.7  5.6 6477692 909312 ?      Sl   13:12   0:01 /marian/marian-server
mt            40  1.8  4.9 6368688 804292 ?      Sl   13:12   0:01 /marian/marian-server
mt            54  1.8  4.7 6337260 769160 ?      Sl   13:12   0:01 /marian/marian-server
mt            68  2.3  5.6 6471172 903512 ?      Sl   13:12   0:01 /marian/marian-server
mt            80  1.7  4.5 6294392 730320 ?      Sl   13:12   0:01 /marian/marian-server
mt            94  2.3  5.6 6481420 911684 ?      Sl   13:12   0:01 /marian/marian-server
mt           107  2.4  5.5 6465936 897852 ?      Sl   13:12   0:01 /marian/marian-server
mt           121  2.1  5.3 6420944 856220 ?      Sl   13:12   0:01 /marian/marian-server
mt           135  2.1  5.0 6377924 813952 ?      Sl   13:12   0:01 /marian/marian-server
mt           150  2.5  5.4 6450768 875932 ?      Sl   13:12   0:02 /marian/marian-server
mt           162  1.9  4.6 6315028 742584 ?      Sl   13:12   0:01 /marian/marian-server
mt           174  3.6  8.5 6940920 1369900 ?     Sl   13:12   0:02 /marian/marian-server
mt           320  0.0  0.0   5192   736 pts/0    S+   13:14   0:00 grep marian-server
              total        used        free      shared  buff/cache   available
Mem:           15Gi        13Gi       173Mi       131Mi       1.7Gi       1.5Gi
Swap:            0B          0B          0B

I will try to see if switching to CUDA 10.2 makes a difference

patrickhuy commented 3 years ago

I tried again with CUDA 10.2 (so only upgrading Marian and not upgrading "everything") and could NOT reproduce the issue anymore. With Marian 1.10 on CUDA 10.2 I also have 8.5G available on that machine in this setup. I'll try downgrading mkl on the cuda 11.2 setup but I suspect that it's actually caused by the different cuda version. Does Marian do anything differently for CUDA 11 or might this just be CUDA 11 requiring a higher amount of memory?

emjotde commented 3 years ago

Good info. The only thing I can think of would be switch to newer Cusparse stuff which was involved last time, but init is still lazy and should not be called if you don't use it. Standard models don't. Until I have a detailed look the CUDA11 theory might be the most probable one.

emjotde commented 3 years ago

After fighting CUDA 11 to compile with 1.9.1, it seems it is indeed that. I see about 10 MB difference between 1.9.1 and 1.10.0 with CUDA 11. For both versions I see a drop of about 40 MB when going back to CUDA 10.2.

patrickhuy commented 3 years ago

Thank you for looking into it! How big was the model you tested this with? I wonder why you are seeing a 40MB difference while I am getting a ~400MB difference.

It strongly looks like this new phenomena is not a Marian issue but instead a CUDA thing/issue (?)

emjotde commented 3 years ago

Are you getting 400 MB per process?

emjotde commented 3 years ago

Ah yeah, I see it above. Hm. Can you share model configs and server settings?

patrickhuy commented 3 years ago

I hope this is what you are looking for: model config: https://gist.github.com/patrickhuy/a5e86535debced6b390decb9bd405096 inference/server config: https://gist.github.com/patrickhuy/b164ba4cfb2848f50ea82a66803a1376

Marian is built with

cmake .. -DCOMPILE_SERVER=on -DCMAKE_BUILD_TYPE=Slim -DUSE_SENTENCEPIECE=false -DBUILD_ARCH=westmere -DCOMPILE_CUDA_SM35=false -DCOMPILE_CUDA_SM50=false -DCOMPILE_CUDA_SM60=false -DCOMPILE_CUDA_SM80=false -DINTRINSICS="-mtune=cascadelake -msse2 -msse3 -msse4.1 -msse4.2"

model.npz.best-bleu.npz is ~300MB (if this makes a difference) There is a pytorch issue about CUDA allocating a lot of memory to load kernels: https://github.com/pytorch/pytorch/issues/12873 I wonder if this is related and if something changed there in CUDA 11. I also wonder if it's actually possible to influence this behavior.

Note: I don't really understand how the "Available" memory number is calculated (as it's higher than free+cache for cuda 10), but it seems that the memory is actually usable.

marian-nmt / marian