timoschick / dino

This repository contains the code for "Generating Datasets with Pretrained Language Models".
https://arxiv.org/abs/2104.07540
Apache License 2.0
187 stars 24 forks source link

What instance did you use? #5

Closed agombert closed 3 years ago

agombert commented 3 years ago

Hey,

Thanks for the great work !

I tried to follow your tutorial from your blog post on colab, but each time I try it looks like the processed was killed and did not even reach the print("Starting dataset generation with DINO...").

What type of instance (type of TPU/GPU) did you use to apply the implementation ?

Thanks,

Arnault

timoschick commented 3 years ago

Hi @agombert, we've been running all of our experiment using two Nvidia Geforce 1080TI GPUs. Do you have any more information as to why the process was killed (some kind of error message or anything)? Does it work if you use a smaller model (e.g., ‑‑model_name gpt2-medium), a smaller batch size (--batch_size <X>) or less entries per input and label (‑‑num_entries_per_input_and_label <X>)?

agombert commented 3 years ago

Hey @timoschick

Thanks for the quick answer. I think it was a problem of RAM. As Colab by default has only 12.6Go. I used a few GPUs from lambdalabs and it worked fine !

A few questions more on the blog post's experiment:

Have a good day !

Arnault

timoschick commented 3 years ago

Hi @agombert, I don't have exact numbers but if I remember correctly, it was about one or two days. We didn't do any filtering except for removing outputs with less than 16 tokens (that's the ‑‑min_num_tokens 16 flag in the blog post). You can also take a look at the full training dataset that we've used here.

agombert commented 3 years ago

Hey thanks for the answers, I'll dive more in your code in the next months I think. But everything is clear to me now :D