numfocus / YouTubeVideoTimestamps

Adding timestamps to NumFOCUS and PyData YouTube videos!
https://www.youtube.com/c/PyDataTV
MIT License
77 stars 19 forks source link

Transformers from the Ground Up - Sebastian Raschka | PyData Jeddah #146

Open 9x opened 1 year ago

9x commented 1 year ago

0:00 - Introduction 0:42 - Sponsors & Contact information 1:31 - Transformers from the ground up 2:37 - Examples for transformers 4:48 - Outline 6:29 - Disclaimer 7:11 - Augmenting RNNs with attention 7:12 - Why use attention? 11:17 - Attention mechanism 12:18 - architecture of a RNN with attention mechanism, context vector 15:06 - Self-attention 15:20 - A simple form of self-attention 18:10 - Self-attention with learnable weights 20:43 - The original transformer architecture 20:50 - Attention is all you need 24:26 - Multi-head attention 26:24 - Masked multi-head attention 28:08 - Large-scale language models 28:34 - Popular models: GPT & BERT 29:58 - Training transformers: a 2-step approach 31:30 - Feature based approach 32:30 - Fine-tuning approach 34:00 - GPT models 34:57 - GPT-1 37:02 - GPT-2: Zero shot learning 38:14 - GPT-3: Zero- and few-shot learning 41:04 - Bidirectional Encoder Representations from Transformers: BERT 41:30 - BERT pre-training step 1/2: Masked Language Model 43:20 - BERT pre-training step 2/2: Next sentence prediction 44:09 - BERT Fine-tuning for different tasks 45:20 - Fine-tuning a pre-trained BERT model in PyTorch (Code Example) 47:19 - Showing code example 54:53 - HuggingFace trainer class 56:22 - Literature recommendations 57:02 - Q&A 57:30 - Can GPT do classification? 58:29 - Use of transformer based models for image classification? 1:00:39 - Is a transformer based architecture practical outside academia? How can they be made more accesible? 1:03:24 - Can GPT2 be fine-tuned on small datasets? 1:04:53 - Can you speed train for specific domains like medicine? 1:06:23 - Does fine-tuning update the vocabulary? 1:11:31 - How to handle the bottleneck caused by the data loader? 1:14:00 - How can an individual student train a transformer from scratch? Is it even possible? 1:15:33 - Comment on small dataset question 1:16:25 - Why is layer norm used and not batch norm? 1:17:40 - Closing remarks

kshitijdshah99 commented 7 months ago

Hey, @9x I would love to contribute to this issue. I have had a previous experience in building a Transformer architecture from scratch hence looking forward to work with you in this issue. Can you tell me what exactly am I expected to do?