It looks like you trained the big 1558M model with gpt_2_simple. Do you mind sharing your training machine setup?
I've attempted to recreate this with:
AWS
p3.8xlarge /
Deep Learning Base AMI /
tensorflow-training:1.15.0-gpu-py36-cu100-ubuntu18.04 (ecr) /
and gpt_2_simple finetune with your generator/simple/finetune.py settings
I get OOM errors either right before training starts or shortly after. The GPU's are all initialized so that really shouldn't happen with 4 nvlink'd teslas.
I'm stuck.
I know you all are busy but any tips would be appreciated. Thanks in advance!
🤓 Question
It looks like you trained the big 1558M model with gpt_2_simple. Do you mind sharing your training machine setup?
I've attempted to recreate this with: AWS p3.8xlarge / Deep Learning Base AMI / tensorflow-training:1.15.0-gpu-py36-cu100-ubuntu18.04 (ecr) / and gpt_2_simple finetune with your generator/simple/finetune.py settings
I get OOM errors either right before training starts or shortly after. The GPU's are all initialized so that really shouldn't happen with 4 nvlink'd teslas.
I'm stuck.
I know you all are busy but any tips would be appreciated. Thanks in advance!