latitudegames / AIDungeon

Infinite adventures await!
http://www.aidungeon.io/
MIT License
3.18k stars 556 forks source link

[Q/A] Training 1558M with gpt_2_simple #202

Open hexive opened 4 years ago

hexive commented 4 years ago

🤓 Question

It looks like you trained the big 1558M model with gpt_2_simple. Do you mind sharing your training machine setup?

I've attempted to recreate this with: AWS p3.8xlarge / Deep Learning Base AMI / tensorflow-training:1.15.0-gpu-py36-cu100-ubuntu18.04 (ecr) / and gpt_2_simple finetune with your generator/simple/finetune.py settings

I get OOM errors either right before training starts or shortly after. The GPU's are all initialized so that really shouldn't happen with 4 nvlink'd teslas.

I'm stuck.

I know you all are busy but any tips would be appreciated. Thanks in advance!

louisgv commented 4 years ago

Yeah it's insane how the creator was able to finetune the giant model. I'm interested to know too