[Q/A] Training 1558M with gpt_2_simple

🤓 Question

It looks like you trained the big 1558M model with gpt_2_simple. Do you mind sharing your training machine setup?

I've attempted to recreate this with: AWS p3.8xlarge / Deep Learning Base AMI / tensorflow-training:1.15.0-gpu-py36-cu100-ubuntu18.04 (ecr) / and gpt_2_simple finetune with your generator/simple/finetune.py settings

I get OOM errors either right before training starts or shortly after. The GPU's are all initialized so that really shouldn't happen with 4 nvlink'd teslas.

I'm stuck.

I know you all are busy but any tips would be appreciated. Thanks in advance!

latitudegames / AIDungeon

[Q/A] Training 1558M with gpt_2_simple #202

🤓 Question