RuntimeError: CUDA out of memory after training 1 epoch

mattiadg / FBK-Fairseq-ST

An adaptation of Fairseq to (End-to-end) speech translation.

Other

22 stars 13 forks source link

RuntimeError: CUDA out of memory after training 1 epoch #8

Open balag59 opened 4 years ago

balag59 commented 4 years ago

@mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks!

mattiadg commented 4 years ago

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way?

Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan notifications@github.com ha scritto:

@mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA .

balag59 commented 4 years ago

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan notifications@github.com ha scritto: … @mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA .

@mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though).

mattiadg commented 4 years ago

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem.

Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan notifications@github.com ha scritto:

Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan notifications@github.com ha scritto: … <#m4792665824052452770> @mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA .

@mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642403741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA .

balag59 commented 4 years ago

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan notifications@github.com ha scritto: … Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m4792665824052452770> @mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA .

Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)?

mattiadg commented 4 years ago

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence.

Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan notifications@github.com ha scritto:

I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan notifications@github.com ha scritto: … <#m-3582157314581188595> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m4792665824052452770> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642403741>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA .

Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642537417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

balag59 commented 4 years ago

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan notifications@github.com ha scritto: … I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @. ha scritto: … <#m-3582157314581188595> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @. ha scritto: … <#m4792665824052452770> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

Thank you so much! This helps!

balag59 commented 4 years ago

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan notifications@github.com ha scritto: … I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @. ha scritto: … <#m-3582157314581188595> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @. ha scritto: … <#m4792665824052452770> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did.

mattiadg commented 4 years ago

Oops, yes my mistake. Sorry, I haven't been using this code for a while now. The problem is that it can load more segments than the ones actually used in a single iteration, if it loads more segments than - - max-sentences, so when it is too high it just occupies gpu memory.

Il gio 11 giu 2020, 12:54 Balaji Radhakrishnan notifications@github.com ha scritto:

--max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan notifications@github.com ha scritto: … <#m-7013085008636611461> I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @. ha scritto: … <#m-3582157314581188595> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @. ha scritto: … <#m4792665824052452770> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8> <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 https://github.com/mattiadg/FBK-Fairseq-ST/issues/8 (comment) <#8 (comment) https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642403741>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642537417>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA .

I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mattiadg/FBK-Fairseq-ST/issues/8#issuecomment-642568911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIR7FG4YXAFQRBMGAQTRWCZW3ANCNFSM4N2Z5DJA .

balag59 commented 4 years ago

Oops, yes my mistake. Sorry, I haven't been using this code for a while now. The problem is that it can load more segments than the ones actually used in a single iteration, if it loads more segments than - - max-sentences, so when it is too high it just occupies gpu memory. Il gio 11 giu 2020, 12:54 Balaji Radhakrishnan notifications@github.com ha scritto: … --max-tokens sets a maximum length for the (source) segments. Those longer than the parameter are removed from the sets. A lower value means less and shorter samples, so it speeds up a bit an epoch. I have never noticed significant differences in the convergence. Il gio 11 giu 2020, 11:44 Balaji Radhakrishnan @. ha scritto: … <#m-7013085008636611461> I think that it is possible that in the validation set there are samples that are too large. Let's exclude all the possibilities that are easy to solve before thinking about a memory leak that is more difficult to detect. I have trained using datasets with a few millions of samples and never had such a problem. Il gio 11 giu 2020, 06:48 Balaji Radhakrishnan @. ha scritto: … <#m-3582157314581188595> Hi, Try to use the validation set in both training and validation. Do you get the same error during training this way? Il gio 11 giu 2020, 00:02 Balaji Radhakrishnan @.*** ha scritto: … <#m4792665824052452770> @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I'm currently training on a very very large dataset with 4 GPUs and I get a CUDA out of memory error after the completion of 1 training epoch. After the training is complete, when validation starts, it runs out of memory. Here is the exact message: Tried to allocate 7.93 GiB (GPU 2; 22.38 GiB total capacity; 11.55 GiB already allocated; 3.53 GiB free; 6.75 GiB cached) Is this a memory leak? Is there an issue with emptying the cache or do I just need to reduce the batch size/max tokens?(already tried reducing the batch size by half and the same error occurs) Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> <#8 <#8>> <#8 <#8> <#8 <#8>>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDISVGJESD3WRGS4IOBDRV77G7ANCNFSM4N2Z5DJA . @mattiadg https://github.com/mattiadg https://github.com/mattiadg https://github.com/mattiadg I have't tried using the same set yet. Training runs fine and runs the entire epoch. The issue begins after training of 1 epoch is complete and when it tries to perform the validation which leads me to believe that memory is not being released correctly( I'm not sure though). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 <#8> (comment) <#8 (comment) <#8 (comment)>>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIU7PBRASF6FXU7EWA3RWBOYLANCNFSM4N2Z5DJA . Thanks! I restored the batch size back to 512 but reduced the max-tokens from 12k to 6k and it seems to be working fine now. How does the max-tokens parameter affect the time to convergence or performance( if it does affect them at all)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment) <#8 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIVHUOSLKIF3Z32TWXLRWCRRBANCNFSM4N2Z5DJA . I'm sorry but doesn't max-tokens stand for the maximum number of audio frames that can be loaded in a single GPU for every iteration? I thought it did. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LDIR7FG4YXAFQRBMGAQTRWCZW3ANCNFSM4N2Z5DJA .

Thanks that makes sense! Speaking of the code, is there a possibility that you will be releasing the code from your latest paper which brings in improvements like knowledge distillation?

balag59 commented 4 years ago

@mattiadg Any updates on the possibility of releasing code from the latest paper?