k2-fsa / snowfall

Moved to https://github.com/k2-fsa/icefall
Apache License 2.0
144 stars 42 forks source link

Debug possible memory leaks #75

Open danpovey opened 3 years ago

danpovey commented 3 years ago

We need to debug possible memory leaks in k2, in master and also the arc_scores branch from my repo (see a PR on k2 master). (I'm pretty sure arc_scores branch leaks; not 100% sure about master).

(1) Monitor memory usage from nvidia-smi and/or adding statements to the training script, and verify whether memory seems to increase.

(2) Look for available diagnostics from torch; I believe it can print out info about blocks allocated.

or

(3) [somewhat alternative to (2)] Add print statements to the constructor and destructor(/equivalents) of the Fsa object to check that the same number are constructed/destroyed on each iter. If those are not the same: add similar print statements to the Function objects in autograd.py and see which of them are not destroyed. I suspect it's a version of a previous issue where we had reference cycles between Fsa objects and Function objects used in backprop.

danpovey commented 3 years ago

BTW, these leaks may actually be in the decoding script. Should probably check both train and decode (that memory usage doesn't systematically increase with iteration.)

csukuangfj commented 3 years ago

I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.

danpovey commented 3 years ago

The key thing is whether ti systematically increases over time.

torch.cuda.get_device_properties(0).total_memory may help

On Sun, Jan 17, 2021 at 4:58 PM Fangjun Kuang notifications@github.com wrote:

I am running mmi_bigram_train.py. There is no OOM after processing 1850 batches. nvidia-smi shows the used RAM is about 19785 MB.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761757276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6BOQJC76ALXNKLJILS2KRDRANCNFSM4WF2S6CA .

csukuangfj commented 3 years ago

torch.cuda.get_device_properties(0).total_memory returns the property of the device, e.g., its capacity, which is a constant.


From https://pytorch.org/docs/stable/cuda.html#torch.cuda.memory_allocated, I am using

torch.cuda.memory_allocated(0) / 1024.  # KB

The result is shown below. The allocated memory is increased monotonically. It is increased by about 500 KB every 10 batches.


The branch arc_scores is used.

Screen Shot 2021-01-17 at 18 29 50

csukuangfj commented 3 years ago

The figure matches the output Allocated memory from torch.cuda.memory_summary(0).

danpovey commented 3 years ago

OK. Try adding print statements to the Fsa object init/destructor (there is a way, I forget).. to see if they are properly released.

On Sun, Jan 17, 2021 at 6:38 PM Fangjun Kuang notifications@github.com wrote:

The figure matches the output Allocated memory from torch.cuda.memory_summary(0).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761768988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3AW2TBFE33X2QM6FDS2K43TANCNFSM4WF2S6CA .

csukuangfj commented 3 years ago
def __del__(self):
  print('inside Fsa destructor')

I think the above destructor will work.

csukuangfj commented 3 years ago

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

danpovey commented 3 years ago

Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.

On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang notifications@github.com wrote:

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .

danpovey commented 3 years ago

.. I mean the delta per minibatch.

On Sun, Jan 17, 2021 at 7:15 PM Daniel Povey dpovey@gmail.com wrote:

Look at the torch.cuda.memory_summary to see how many memory regions are allocated, that might give us a hint.

On Sun, Jan 17, 2021 at 7:09 PM Fangjun Kuang notifications@github.com wrote:

I confirm that den and num are freed as their destructors are called. I print id(self) in the destructor and the output matches id(den).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-761772772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7TV6ZQFGB5PYLVJ4DS2LANPANCNFSM4WF2S6CA .

csukuangfj commented 3 years ago

The arc_scores branch is able to train and decode without OOM after 10 epochs:

2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]
danpovey commented 3 years ago

Great!!

On Mon, Jan 18, 2021 at 3:34 PM Fangjun Kuang notifications@github.com wrote:

The arc_scores branch is able to train and decode without OOM after 10 epochs:

2021-01-18 14:56:43,771 INFO [mmi_bigram_decode.py:296] %WER 10.45% [5493 / 52576, 801 ins, 487 del, 4205 sub ]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-762047456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6HP6N3RNY77NPE6FDS2PQARANCNFSM4WF2S6CA .

hegc commented 3 years ago

When I train aishell1 with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. image

danpovey commented 3 years ago

Which script are you running?

On Mon, Mar 1, 2021 at 6:49 PM ffhh notifications@github.com wrote:

When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .

danpovey commented 3 years ago

Also let us know how recent your k2, lhotse and snowfall versions are, e.g. the date of the last commit or possibly the k2 release number would help. At some point we had some memory leaks in k2.

On Mon, Mar 1, 2021 at 7:21 PM Daniel Povey dpovey@gmail.com wrote:

Which script are you running?

On Mon, Mar 1, 2021 at 6:49 PM ffhh notifications@github.com wrote:

When I train aishell with mmi_att_transformer_train.py, the allocated memory is increased gradually, and OOM after 2 epochs. [image: image] https://user-images.githubusercontent.com/8347017/109487121-c0aefb00-7abe-11eb-9512-42128021f3bb.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787851808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRNIARAWLFHYPK443TBNWKHANCNFSM4WF2S6CA .

hegc commented 3 years ago

I updated the three project last week.

hegc commented 3 years ago

And I modify the mmi_att_transformer_train.py to fit the aishell1.

danpovey commented 3 years ago

please show your changes via PR.

On Monday, March 1, 2021, ffhh notifications@github.com wrote:

And I modify the mmi_att_transformer_train.py to fit the aishell1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/75#issuecomment-787880660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZNS74ANFB23RHNM2DTBN33BANCNFSM4WF2S6CA .

hegc commented 3 years ago

The PR #114