shawwn / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
MIT License
108 stars 36 forks source link

Saving to storage bucket #10

Open TheClassyPenguin opened 4 years ago

TheClassyPenguin commented 4 years ago

Hi,

I'm training on the 1588M model on gcp. I see that your notebook mentions a parameter by the name of --storage_bucket.

 # Note that there is currently no support for saving the trained model.
 # Theoretically it might work, but you'll have to create your own storage bucket and pass in --storage_bucket gs://your-bucket/gpt-2/

I need the feature and I'm in the position to test it out but I've seen the code for it is not in the train.py file. Is it something you implemented and ultimately decided not to include as you couldn't test it yourself?

Let me know if you have the code!

Norod commented 4 years ago

@shawwn I also would love to know how to get --storage_bucket in and working

JadynHax commented 4 years ago

@Norod @TheClassyPenguin: I've looked into this a bit personally, and it seems this is not something the code itself imposes, it's due to a limitation of using a TPU on Colaboratory. For some reason, Colaboratory makes you store models on a Google Cloud Storage bucket (see here for info on creating one) in TPU runtimes.

It appears that someone may have found a way around it, though I haven't had time to confirm this myself. Placing it here for completeness only.

Hope that was helpful in explaining what --storage_bucket is for!

Norod commented 3 years ago

@JaonHax Thank you for the reply, but I'm afraid there has been a misunderstanding. I know what '--storage_bucket' is for and that's why I'd love to have a version where it is implemented in the code. Currently '--storage_bucket' is only mentioned as a comment. Other than that, I did not encounter any issues with storing the checkpoints locally in colab's local storage, so from time to time I just need to remember to stop the execution and manually copy the most recent checkpoint into a bucket (so not to loose progress when it resets). Having '--storage_bucket' support will make it all more convenient.

JadynHax commented 3 years ago

@Norod Ah, yeah. I hadn't realized at the time because I was still using the tpu-multi-snapshot branch, which does have it implemented. Sorry for not understanding what you meant!