training-transformers-together / training-transformers-together.github.io

Contents of the main NeurIPS 2021 demo page
MIT License
2 stars 0 forks source link

[Section] Efficient Training #3

Open justheuristic opened 2 years ago

justheuristic commented 2 years ago
justheuristic commented 2 years ago

image

justheuristic commented 2 years ago

Talked with @TimDettmers

Planned text layout:

If we have time, add plots based on the calculator

TimDettmers commented 2 years ago

@mryab could you please review the notebook if more explanations are needed and if the flow through the story is good. Here the current notebook: https://colab.research.google.com/drive/1jeX4Qcq4O_kWxfta9fkXDeZ6NFYoqoxJ?usp=sharing

TimDettmers commented 2 years ago

@lhoestq Here a draft of the efficient training tab. I already added a paragraph on dataset streaming. Please feel free to edit and expand the doc directly. https://docs.google.com/document/d/1RGWYcXM3F4rdwkJZNmPjJjWKi9aOYru1FuKtet1xhdw/edit?usp=sharing

justheuristic commented 2 years ago

@TimDettmers i've modified it to be a tiny bit more memory efficient, take a look (same as in slack) https://colab.research.google.com/drive/1WhcadcfMPzbiLUljlfIKzUhMUrbxIlyX?usp=sharing

justheuristic commented 2 years ago

Quick review:

Optional

Do you think it would make sense to showcase how it's used in our demo's first experiment with dalle?

If so, here's DALLE with 1B parameters that fits on a k80 with Adam8Bit, but takes 19gb+ with regular Adam https://colab.research.google.com/drive/1b_0KLGOY9Dbbgup-Ln0fGX2TiDc8Y_Ih?usp=sharing

TimDettmers commented 2 years ago

Thanks for the memory fix and catching that bug! Here the most recent notebook: https://colab.research.google.com/drive/1Ii3JRnpI-15qoFhd8lgxXGwUQIiM7u0o?usp=sharing

lhoestq commented 2 years ago

From my message on slack:

You can switch to using the code dataset with

args.dataset_name = "transformersbook/codeparrot-train"
args.dataset_config_name = None
args.text_column_name = "content"

As you want. Personally I realized that there are quite a lot of info already in the notebook, so if switching to the code dataset could make things confusing for users I would just stick to using C4