Fixes issues with fine tuning on GPU's

nickwalton commented 5 years ago

I previously had some issues with training on GPU's #32. This fixes those and other issues to make training on GPU's work. Not sure if you want to merge it in, but figure I'd put it up if anyone else has fine tuning issues.

salesforce-cla[bot] commented 5 years ago

Thanks for the contribution! Before we can merge this, we need @nickwalton to sign the Salesforce.com Contributor License Agreement.

pgrandinetti commented 4 years ago

Hi @nickwalton I'm testing this in google cloud with 2 NVIDIA Tesla V100, 16 vCPUs 60GB RAM but I still get an error with the GPU memory:

2019-11-28 11:21:32.019196: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15651900416 memory_limit_: 15651900621 available bytes: 205 curr_region_allocation_bytes_: 31303801344
2019-11-28 11:21:32.019208: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: 
Limit:                 15651900621
InUse:                 15370949632
MaxInUse:              15373571072
NumAllocs:                    5630
MaxAllocSize:           1262254080

(and the actual error is)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Dst tensor is not initialized.
     [[{{node tied_embedding_softmax_1/add/_24}}]]
     [[training/clip_by_global_norm/mul_2/_36]]
  (1) Internal: Dst tensor is not initialized.
     [[{{node tied_embedding_softmax_1/add/_24}}]]

May I know the specs of the environment you used that was successful? Thank you.

nickwalton commented 4 years ago

Hello,

The fix I thought worked didn't actually work as I expected. I eventually switched to just using a single GPU and it worked. I think I had been having issues with multi GPUs

On Thu, Nov 28, 2019, 5:27 AM Pietro notifications@github.com wrote:

Hi @nickwalton https://github.com/nickwalton I'm testing this in google cloud with 2 NVIDIA Testla V100, 16 vCPUs 60GB RAM but I still get an error with the GPU memory:

2019-11-28 11:21:32.019196: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocatedbytes: 15651900416 memorylimit: 15651900621 available bytes: 205 curr_region_allocationbytes: 31303801344 2019-11-28 11:21:32.019208: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 15651900621 InUse: 15370949632 MaxInUse: 15373571072 NumAllocs: 5630 MaxAllocSize: 1262254080

(and the actual error is)

tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Dst tensor is not initialized. [[{{node tied_embedding_softmax_1/add/_24}}]] [[training/clip_by_global_norm/mul_2/_36]] (1) Internal: Dst tensor is not initialized. [[{{node tied_embedding_softmax_1/add/_24}}]]

May I know the specs of the environment you used that was successful? Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/ctrl/pull/51?email_source=notifications&email_token=AFJNOQBO7PJQAX74U2HOPUDQV6TLFA5CNFSM4JDZKPRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFMJ7WI#issuecomment-559456217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFJNOQAANKZYKWRUY6H2DJDQV6TLFANCNFSM4JDZKPRA .

pgrandinetti commented 4 years ago

I see. May I know what (single) GPU you used and what GPU provider (if any). I can't find a single 32Gb GPU on GCP.

julien-c commented 4 years ago

If AWS is an option, they have p3dn.24xlarge with 8x 32GB V100s.

They are pretty costly though (use Spot instances)

salesforce / ctrl

Fixes issues with fine tuning on GPU's #51