Closed nickwalton closed 5 years ago
Thanks for the contribution! Before we can merge this, we need @nickwalton to sign the Salesforce.com Contributor License Agreement.
Hi @nickwalton I'm testing this in google cloud with 2 NVIDIA Tesla V100, 16 vCPUs 60GB RAM but I still get an error with the GPU memory:
2019-11-28 11:21:32.019196: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15651900416 memory_limit_: 15651900621 available bytes: 205 curr_region_allocation_bytes_: 31303801344
2019-11-28 11:21:32.019208: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 15651900621
InUse: 15370949632
MaxInUse: 15373571072
NumAllocs: 5630
MaxAllocSize: 1262254080
(and the actual error is)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Dst tensor is not initialized.
[[{{node tied_embedding_softmax_1/add/_24}}]]
[[training/clip_by_global_norm/mul_2/_36]]
(1) Internal: Dst tensor is not initialized.
[[{{node tied_embedding_softmax_1/add/_24}}]]
May I know the specs of the environment you used that was successful? Thank you.
Hello,
The fix I thought worked didn't actually work as I expected. I eventually switched to just using a single GPU and it worked. I think I had been having issues with multi GPUs
On Thu, Nov 28, 2019, 5:27 AM Pietro notifications@github.com wrote:
Hi @nickwalton https://github.com/nickwalton I'm testing this in google cloud with 2 NVIDIA Testla V100, 16 vCPUs 60GB RAM but I still get an error with the GPU memory:
2019-11-28 11:21:32.019196: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocatedbytes: 15651900416 memorylimit: 15651900621 available bytes: 205 curr_region_allocationbytes: 31303801344 2019-11-28 11:21:32.019208: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 15651900621 InUse: 15370949632 MaxInUse: 15373571072 NumAllocs: 5630 MaxAllocSize: 1262254080
(and the actual error is)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Dst tensor is not initialized. [[{{node tied_embedding_softmax_1/add/_24}}]] [[training/clip_by_global_norm/mul_2/_36]] (1) Internal: Dst tensor is not initialized. [[{{node tied_embedding_softmax_1/add/_24}}]]
May I know the specs of the environment you used that was successful? Thank you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/ctrl/pull/51?email_source=notifications&email_token=AFJNOQBO7PJQAX74U2HOPUDQV6TLFA5CNFSM4JDZKPRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFMJ7WI#issuecomment-559456217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJNOQAANKZYKWRUY6H2DJDQV6TLFANCNFSM4JDZKPRA .
I see. May I know what (single) GPU you used and what GPU provider (if any). I can't find a single 32Gb GPU on GCP.
If AWS is an option, they have p3dn.24xlarge
with 8x 32GB V100s.
They are pretty costly though (use Spot instances)
I previously had some issues with training on GPU's #32. This fixes those and other issues to make training on GPU's work. Not sure if you want to merge it in, but figure I'd put it up if anyone else has fine tuning issues.