rasbt / LLMs-from-scratch

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167
Other
30.37k stars 3.58k forks source link

ch06/03_bonus_imdb-classification #155

Closed d-kleine closed 5 months ago

d-kleine commented 5 months ago

This might be still WIP, but I have issues reproducing the output in ch06/03_bonus_imdb-classification:

  1. scripts gpt_download.py and previous_chapters.py missing in folder, therefore cannot run python train-gpt.py as instructed in the README
  2. It seems like python download-prepare-dataset.py does not correctly create the test and validation set (train set seems to be fine though):
    • When copying the files from ch06/02_bonus_additional-experiments to ch06/03_bonus_imdb-classification, running python train-gpt.py results in a val loss of NaNs:
      root@2f7823635ae1:/workspaces/LLMs-from-scratch/ch06/03_bonus_imdb-classification# python train-gpt.py
      2024-05-14 12:55:48.662928: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
      2024-05-14 12:55:48.689449: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
      To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
      2024-05-14 12:55:49.247512: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
      File already exists and is up-to-date: gpt2/124M/checkpoint
      File already exists and is up-to-date: gpt2/124M/encoder.json
      File already exists and is up-to-date: gpt2/124M/hparams.json
      File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
      File already exists and is up-to-date: gpt2/124M/model.ckpt.index
      File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
      File already exists and is up-to-date: gpt2/124M/vocab.bpe
      2024-05-14 12:55:57.722961: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154389504 exceeds 10% of free system memory.
      Ep 1 (Step 000000): Train loss 6.594, Val loss nan
      Ep 1 (Step 000050): Train loss 2.141, Val loss nan
      Ep 1 (Step 000100): Train loss 0.590, Val loss nan
      Ep 1 (Step 000150): Train loss 0.107, Val loss nan
      Ep 1 (Step 000200): Train loss 0.042, Val loss nan
      Ep 1 (Step 000250): Train loss 0.030, Val loss nan
      Ep 1 (Step 000300): Train loss 0.019, Val loss nan
      Ep 1 (Step 000350): Train loss 0.011, Val loss nan
rasbt commented 5 months ago

I added the missing files and updated your PR. Hm, regarding the NaN loss, that's weird. Can you try again with the updated files? It seems to work fine for me:

Screenshot 2024-05-14 at 8 13 26 AM
rasbt commented 5 months ago

-> insteatd of val.csv should be validation.csv in train-sklearn-logreg.py (as defined in download-prepare-dataset.py)

Arg, I updated this last week, I must have forgotten to push the changes.

rasbt commented 5 months ago

I think it should be all addressed now.

d-kleine commented 5 months ago

I have pulled every commit until now. Currently testing with Windows and Docker container. For me, the test and validation files are corrupt when downloading via python download-prepare-dataset.py for whatever reason.

Windows already starts with problems with the reporthook:

(pt) C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification>conda.bat activate test

(test) C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification>python download-prepare-dataset.py
Traceback (most recent call last):
  File "C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification\download-prepare-dataset.py", line 77, in <module>
    download_and_extract_dataset(dataset_url, "aclImdb_v1.tar.gz", "aclImdb")
  File "C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification\download-prepare-dataset.py", line 34, in download_and_extract_dataset
    urllib.request.urlretrieve(dataset_url, target_file, reporthook)
  File "C:\Users\dk\anaconda3\envs\test\lib\urllib\request.py", line 277, in urlretrieve
    reporthook(blocknum, bs, size)
  File "C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification\download-prepare-dataset.py", line 22, in reporthook
    speed = progress_size / (1024**2 * duration)
ZeroDivisionError: float division by zero

Update for Windows. This line of code fixes the download issue:

speed = int(progress_size / (1024 * duration)) if duration else 0
d-kleine commented 5 months ago

Worked with Windows now, same with Docker running Ubuntu.

The issue was that the script created broken test and validation set. The split takes 5-10 minutes to run properly, even though it seems to use only a small amount of resources on my PC. This should be reflected in the README imho. I think this takes so long because of the text data, which is not ideal for a pandas df. Maybe there is a way to speed up this splitting process in download-prepare-dataset.py with pytorch?

rasbt commented 5 months ago

Thanks for testing. I will add the line later and investigate more. On my laptop the whole thing didn't take more than 40 sec (37 sec to be precise, see below) so maybe there's still something odd going on on Windows.

Screenshot 2024-05-14 at 9 59 51 AM

EDIT: This is 20.16 for downloading, and 17 sec for processing, hence the 37 sec in total

Token indices sequence length is longer than the specified maximum sequence length for this model (717 > 512). Running this sequence through the model will result in indexing errors We strongly recommend passing in an attention_mask since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.

As far as I understand, it's automatically truncating it to 512 (BERT doesn't support longer inputs). I can see if I can manually truncate it to suppress the message.

d-kleine commented 5 months ago

Windows: grafik

And I have pretty solid setup (AMD Ryzen 7 5800X3D, 32 GB DDR4 Ram, Nvidia RTX 3080 Ti, etc.)...

d-kleine commented 5 months ago

Can you please also fix the README to:

Run the following code to create the train.csv, validation.csv, and test.csv datasets:

(val.csv -> validation.csv)

rasbt commented 5 months ago

And I have pretty solid setup (AMD Ryzen 7 5800X3D, 32 GB DDR4 Ram, Nvidia RTX 3080 Ti, etc.)...

Oh wow, that should be a solid setup indeed. Out of curiosity, are do you have an SSD or an HD? This could maybe explain the difference. Just a hunch

d-kleine commented 5 months ago

And I have pretty solid setup (AMD Ryzen 7 5800X3D, 32 GB DDR4 Ram, Nvidia RTX 3080 Ti, etc.)...

Oh wow, that should be a solid setup indeed. Out of curiosity, are do you have an SSD or an HD? This could maybe explain the difference. Just a hunch

Even a M.2 SSD (Crucial P5 Plus 2TB) on a X470 mainboard. Also had no other programs running actively running in parallel when testing, so it must be something related to Windows

rasbt commented 5 months ago

I just gave it a try on Google Colab, which generally has a low-end CPU. It seems like even there the whole processing only takes 56 sec

Screenshot 2024-05-14 at 8 17 15 PM

I wonder if it's maybe also Docker related? I have limited experience with Docker, so I am just guessing. But maybe it can't handle large amounts of files being created in a short time (e.g., when unzipping the dataset)?

rasbt commented 5 months ago

I updated the code via #156 as you suggested.

Regarding

Token indices sequence length is longer than the specified maximum sequence length for this model (717 > 512). Running this sequence through the model will result in indexing errors

I think this is a nonsense warning that gets triggered. There is no sequence longer than 256 tokens, I double checked that. I think it's seeing token IDs with larger values and then thinks there could potentially be longer sequences, but that's not true

We strongly recommend passing in an attention_mask since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.

That's an interesting one actually. I think that it would be a good comparison for the GPT model as well to alter the attention mask such that ignores the padding tokens. This is a bit more complicated as it will require some modifications to the GPTModel code, but I made a note and reminder to do this as a bonus code experiment along with a write-up/tutorial for Chapter 6 once the last chapter is finished

d-kleine commented 5 months ago

I wonder if it's maybe also Docker related? I have limited experience with Docker, so I am just guessing. But maybe it can't handle large amounts of files being created in a short time (e.g., when unzipping the dataset)?

Windows runs outside of Docker, so Docker cannot be the reason for the slow progress on Windows itself. I have done another test with Docker running an Ubuntu image, and this takes even longer (25min): grafik

As far as I could monitor the process, it seems like the zip file extraction took quite long. That this is slower in Docker is normal as you practically run your code in an OS (Ubuntu in this image) in another OS (Windows at my end). But phew...

rasbt commented 5 months ago

Windows runs outside of Docker, so Docker cannot be the reason for the slow progress on Windows itself. I have done another test with Docker running an Ubuntu image, and this takes even longer (25min):

Oh I see, I thought you ran it in Docker previously when you reported the Windows slowness.

And wow, the Ubuntu one via Docker looks super slow as well. You mentioned it took 5-10 minutes on Ubuntu without Docker previously? The increase from 5-10 to 25 min I can perhaps understand. But still 5-10 min on Ubuntu sounds slow. When I run it on Google Colab or Lightning Studios (both use Ubuntu), it's just ~2-3 min maybe.

I'm curious, how long does it take to just unzip the downloaded aclImdb_v1.tar.gz dataset file using tar -xzvf aclImdb_v1.tar.gz from the command line?

d-kleine commented 5 months ago

Oh I see, I thought you ran it in Docker previously when you reported the Windows slowness.

And wow, the Ubuntu one via Docker looks super slow as well. You mentioned it took 5-10 minutes on Ubuntu without Docker previously? The increase from 5-10 to 25 min I can perhaps understand. But still 5-10 min on Ubuntu sounds slow. When I run it on Google Colab or Lightning Studios (both use Ubuntu), it's just ~2-3 min maybe.

No, I have tried Windows and as an alternative Docker (each after another, of course).

I'm curious, how long does it take to just unzip the downloaded aclImdb_v1.tar.gz dataset file using tar -xzvf aclImdb_v1.tar.gz from the command line?

Windows (Powershell): grafik

That makes sense. I haven't the run "unzip only" test on WSL and Docker, but this will take longer there for sure. I also have taken a look into the unzipped aclImdb folder now, there are 100,024 (!) files.

rasbt commented 5 months ago

Thanks for the details. That's interesting, so basically most of the time is spend on the unzipping (3 out of the 3-5 min on windows).

In my case it was a bit quicker:

29 seconds on my laptop macOS

Screenshot 2024-05-15 at 9 07 05 PM

5 seconds on Google Colab (Ubuntu)

Screenshot 2024-05-15 at 9 08 35 PM

So maybe the Windows filesystem is maybe not ideal for this large amount of small files. Yes, it's a lot of files, I think 50k based on the description: https://ai.stanford.edu/~amaas/data/sentiment/

This was the dataset that I originally used for Chapter 6, but I already had a suspicion that it might test the readers' patience 😅, which is why I swapped it with a smaller one that is easier to work with.

So, I think there is fundamentally no issue anymore after adding your fixes, correct? So I will close this issue. (But please correct me if I'm wrong, and thanks for these additional insights on the runtimes!)