google / trax

Trax — Deep Learning with Clear Code and Speed
Apache License 2.0
8.04k stars 813 forks source link

Create a notebook explaining TransformerLM / Transformer #897

Open lukaszkaiser opened 4 years ago

lukaszkaiser commented 4 years ago

It would be great to have a nice notebook explaining TransformerLM and maybe even full Transformer in models/ -- both to explain the code and if possible with illustrations clarifying the concepts.

jalammar commented 4 years ago

Wonderful. I'm on this. I'll post drafts here for team and community feedback.

jalammar commented 4 years ago

I'm checking in with the first draft: https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/Trax_TransformerLM_Intro.ipynb

I still want to add two extra sections. One to go over the components of the Trax TransformerLM at a high level (tl.Serial and tl.Layers). I also want this section to give some intuition into the initialization parameters of TransformerLM (n_layers, d_model, d_ff, n_heads).

I'd love for the final section (or for a follow-up notebook) to be text generation (+training). I'm considering it to be character-level just to save energy and compute while introducing the concept of tokenization.

All feedback welcome!

pkozakowski commented 4 years ago

Wow, I love this. I'm curious, what are you using to make those pics, especially the animated one?

jalammar commented 4 years ago

Thanks! It's still an early draft.

I use Keynote. I'll upload the final Keynote file so people can later update it when necessary.

I'm currently looking at the reformer generation notebook and it's likely the best next step to point the reader. So I think I'll hold off from the generation section for now and focus conceptual illustration of TransformerLM as the final section in the notebook.

jalammar commented 4 years ago

Checking in with the second draft. This version expands on the concepts of Trax layers, Serial, and Branch. It shows the parameters used creating a TransformerLM model. And ends with more advanced concepts such as the residual layer:

https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/Trax_TransformerLM_Intro.ipynb

The "Transformer vs. TransformerLM" section could be supplemented with reasons of why to choose one vs. the other.

Once again, all feedback/corrections welcomed!

jalammar commented 4 years ago

Side note: I really liked the advanced example for Branch in the docs. I couldn't fit it in the notebook, but I've created these illustrations for it:

For example, suppose Branch has three layers:

Then it will take three inputs:

And it will give four outputs:

I also find myself intrigued with the data stack of the Serial combinator. There's room to explain its inner-workings visually as well.

cdezapasquale commented 4 years ago

@jalammar many thanks for this. It is very useful to me. Very well explained. ( i am a very newcomer here)

I have two silly doubts now 1) Data pipeline: at least for me it remains unclear how to do the data pipeline when a) Each instance it is a file itself (for example, using images) b) When I have a file, with all the data. (For example, a csv file where each row is a sentence)

2) In the prediction, this line of code keeps obscure to me predict_signature = trax.shapes.ShapeDtype((1,1), dtype=np.int32)

Thank you so much again :+1:

jalammar commented 4 years ago

Thanks for your comments @facundodeza! Agreed that more examples on data pipeline would be a great addition to the docs. Noted on predict_signature. Thanks again!

j2i2 commented 4 years ago

Hi Jay. I'm starting to play with your notebook and really like it! As I dig in further, I can take notes and share them with you :-). Would you like that to happen in this thread, or does some other way work better for you?

Cheers, Jonni

jalammar commented 4 years ago

That's great, @j2i2. That can happen in this thread, sure!

j2i2 commented 4 years ago

Hi @jalammar,

Here are some first-pass suggestions/comments/context, focusing on the layers material later in the notebook. Use or don't use any of it as you see fit :-).

Typos and orthographic nits:

More concept-motivated suggestions:

Graphics suggestion:

jalammar commented 4 years ago

@j2i2 Thanks for the amazing feedback, Johni! I have incorporated all of your comments in the text of the latest version of the notebook. I'll update the graphics for Relu shortly as well. I do see your point about moving the inputs and outputs outside the visual box. It's why the input and output section are seperated by a faint line from the body of the layer. This choice was mainly to be able to show layers that expect two tensors. Sort of like the top part of concatenate:

This especially factors in explaining Branch (like in the images above in this thread). With those ideas established, I do see the value of having the tensors outside for more advanced future concepts. What do you think?

Thank so much again for the great feedback! please let me know if there's anything else.

j2i2 commented 4 years ago

@jalammar Glad you found the feedback useful, Jay! I'll keep reading through and send more suggestions as they arise.

As for layer inputs and outputs, I tend to think of them as pipes/channels through which streams of data enter and exit the layers. Is there something graphical along those lines that appeals to you? You could keep the data graphics visually outside the layer, and the layer graphic itself would have an explicit indication of its input and output requirements -- like small nubs or pipe mouths, but cleaner to fit in the nice clean style of your layer graphics. (So the Concatenate graphic could have two visual indicators on top for its inputs and one on bottom for its single output.)

Cheers, Jonni

j2i2 commented 4 years ago

@jalammar Hi Jay,

Here are some further general comments/suggestions. Will look next at your specific data/training example.

Best, Jonni

Typos and orthographic nits:

More concept-motivated suggestions:

jalammar commented 3 years ago

Thanks @j2i2! Updated.

Also, good to mention/diagram the residual connection around the feed-forward layer

Good point. I'll add that to the graphic.

Would this example be better for a full Transformer, which translates an input sequence to an output sequence? Could you put a simpler language model example here instead, e.g., decoding starting from nothing but a symbol?

There's room for a full transformer graphic but not sure it belongs in this tutorial. Maybe if we expand the TransformerLM vs. Transformer section we can incorporate it. I'm all for making it simpler with an example of non-conditional generation first, totally.

j2i2 commented 3 years ago

@jalammar Cool; my comment on the TransformerLM example was more about the non-conditional generation ... agree about not complicating this tutorial with a full transformer graphic.

I spoke with Lukasz a little while back about this question, and he mentioned a nice example I think he had seen before, based on learning a Fibonacci sequence, or Fibonacci-like sequence. Would something like that appeal to you?

On a different note, the Trax library will be phasing out trax.supervised.trainerlib. in favor of trax.supervised.training.. In particular, you can replace trainerlib.Trainer with training.Loop. A relevant code sample is here.

Thanks again for your excellent work on this; I'll stay tuned for any further questions or discussion :-).

jalammar commented 3 years ago

Wonderful. Fibonacci sounds great. Kinda like this?

Noted on Trainer => Loop. I'll update the code accordingly. I noticed the transition so I didn't feature Trainer in the visuals prominently. Thanks for listing the example!

j2i2 commented 3 years ago

Yes; with the possible tweak of a different starting value, such as:

which would help distinguish what the network is doing from other interpretations, e.g.,

jalammar commented 3 years ago

@j2i2 Wonderful! Checking in with the updated version here: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/Trax_TransformerLM_Intro.ipynb

Animation:

Residual:

quoniammm commented 3 years ago

@jalammar hi jalammar, i'm glad to see you here.

i am reading your blog and trax code to know about transformer, both are excellent.

But I seem to have a problem. Your blog and the trax code are somewhat inconsistent in the description of Multi Head Attention.

Yours: 图像2021-4-11 下午5 39

Trax code: image the x in above trax code is Q or K or V, it is x only rerange and not like in your blog.

can you tell me why is it?