Closed d-kleine closed 5 months ago
I added the missing files and updated your PR. Hm, regarding the NaN loss, that's weird. Can you try again with the updated files? It seems to work fine for me:
-> insteatd of val.csv should be validation.csv in train-sklearn-logreg.py (as defined in download-prepare-dataset.py)
Arg, I updated this last week, I must have forgotten to push the changes.
I think it should be all addressed now.
I have pulled every commit until now. Currently testing with Windows and Docker container. For me, the test and validation files are corrupt when downloading via python download-prepare-dataset.py
for whatever reason.
Windows already starts with problems with the reporthook:
(pt) C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification>conda.bat activate test
(test) C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification>python download-prepare-dataset.py
Traceback (most recent call last):
File "C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification\download-prepare-dataset.py", line 77, in <module>
download_and_extract_dataset(dataset_url, "aclImdb_v1.tar.gz", "aclImdb")
File "C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification\download-prepare-dataset.py", line 34, in download_and_extract_dataset
urllib.request.urlretrieve(dataset_url, target_file, reporthook)
File "C:\Users\dk\anaconda3\envs\test\lib\urllib\request.py", line 277, in urlretrieve
reporthook(blocknum, bs, size)
File "C:\Users\dk\Desktop\LLMs-from-scratch\ch06\03_bonus_imdb-classification\download-prepare-dataset.py", line 22, in reporthook
speed = progress_size / (1024**2 * duration)
ZeroDivisionError: float division by zero
Update for Windows. This line of code fixes the download issue:
speed = int(progress_size / (1024 * duration)) if duration else 0
Worked with Windows now, same with Docker running Ubuntu.
The issue was that the script created broken test and validation set. The split takes 5-10 minutes to run properly, even though it seems to use only a small amount of resources on my PC. This should be reflected in the README imho. I think this takes so long because of the text data, which is not ideal for a pandas df. Maybe there is a way to speed up this splitting process in download-prepare-dataset.py
with pytorch?
Token indices sequence length is longer than the specified maximum sequence length for this model (717 > 512). Running this sequence through the model will result in indexing errors
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Token indices sequence length is longer than the specified maximum sequence length for this model (686 > 512). Running this sequence through the model will result in indexing errors
Thanks for testing. I will add the line later and investigate more. On my laptop the whole thing didn't take more than 40 sec (37 sec to be precise, see below) so maybe there's still something odd going on on Windows.
EDIT: This is 20.16 for downloading, and 17 sec for processing, hence the 37 sec in total
Token indices sequence length is longer than the specified maximum sequence length for this model (717 > 512). Running this sequence through the model will result in indexing errors We strongly recommend passing in an
attention_mask
since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
As far as I understand, it's automatically truncating it to 512 (BERT doesn't support longer inputs). I can see if I can manually truncate it to suppress the message.
Windows:
And I have pretty solid setup (AMD Ryzen 7 5800X3D, 32 GB DDR4 Ram, Nvidia RTX 3080 Ti, etc.)...
Can you please also fix the README to:
Run the following code to create the train.csv, validation.csv, and test.csv datasets:
(val.csv -> validation.csv)
And I have pretty solid setup (AMD Ryzen 7 5800X3D, 32 GB DDR4 Ram, Nvidia RTX 3080 Ti, etc.)...
Oh wow, that should be a solid setup indeed. Out of curiosity, are do you have an SSD or an HD? This could maybe explain the difference. Just a hunch
And I have pretty solid setup (AMD Ryzen 7 5800X3D, 32 GB DDR4 Ram, Nvidia RTX 3080 Ti, etc.)...
Oh wow, that should be a solid setup indeed. Out of curiosity, are do you have an SSD or an HD? This could maybe explain the difference. Just a hunch
Even a M.2 SSD (Crucial P5 Plus 2TB) on a X470 mainboard. Also had no other programs running actively running in parallel when testing, so it must be something related to Windows
I just gave it a try on Google Colab, which generally has a low-end CPU. It seems like even there the whole processing only takes 56 sec
I wonder if it's maybe also Docker related? I have limited experience with Docker, so I am just guessing. But maybe it can't handle large amounts of files being created in a short time (e.g., when unzipping the dataset)?
I updated the code via #156 as you suggested.
Regarding
Token indices sequence length is longer than the specified maximum sequence length for this model (717 > 512). Running this sequence through the model will result in indexing errors
I think this is a nonsense warning that gets triggered. There is no sequence longer than 256 tokens, I double checked that. I think it's seeing token IDs with larger values and then thinks there could potentially be longer sequences, but that's not true
We strongly recommend passing in an
attention_mask
since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
That's an interesting one actually. I think that it would be a good comparison for the GPT model as well to alter the attention mask such that ignores the padding tokens. This is a bit more complicated as it will require some modifications to the GPTModel
code, but I made a note and reminder to do this as a bonus code experiment along with a write-up/tutorial for Chapter 6 once the last chapter is finished
I wonder if it's maybe also Docker related? I have limited experience with Docker, so I am just guessing. But maybe it can't handle large amounts of files being created in a short time (e.g., when unzipping the dataset)?
Windows runs outside of Docker, so Docker cannot be the reason for the slow progress on Windows itself. I have done another test with Docker running an Ubuntu image, and this takes even longer (25min):
As far as I could monitor the process, it seems like the zip file extraction took quite long. That this is slower in Docker is normal as you practically run your code in an OS (Ubuntu in this image) in another OS (Windows at my end). But phew...
Windows runs outside of Docker, so Docker cannot be the reason for the slow progress on Windows itself. I have done another test with Docker running an Ubuntu image, and this takes even longer (25min):
Oh I see, I thought you ran it in Docker previously when you reported the Windows slowness.
And wow, the Ubuntu one via Docker looks super slow as well. You mentioned it took 5-10 minutes on Ubuntu without Docker previously? The increase from 5-10 to 25 min I can perhaps understand. But still 5-10 min on Ubuntu sounds slow. When I run it on Google Colab or Lightning Studios (both use Ubuntu), it's just ~2-3 min maybe.
I'm curious, how long does it take to just unzip the downloaded aclImdb_v1.tar.gz
dataset file using tar -xzvf aclImdb_v1.tar.gz
from the command line?
Oh I see, I thought you ran it in Docker previously when you reported the Windows slowness.
And wow, the Ubuntu one via Docker looks super slow as well. You mentioned it took 5-10 minutes on Ubuntu without Docker previously? The increase from 5-10 to 25 min I can perhaps understand. But still 5-10 min on Ubuntu sounds slow. When I run it on Google Colab or Lightning Studios (both use Ubuntu), it's just ~2-3 min maybe.
No, I have tried Windows and as an alternative Docker (each after another, of course).
I'm curious, how long does it take to just unzip the downloaded
aclImdb_v1.tar.gz
dataset file usingtar -xzvf aclImdb_v1.tar.gz
from the command line?
Windows (Powershell):
That makes sense.
I haven't the run "unzip only" test on WSL and Docker, but this will take longer there for sure.
I also have taken a look into the unzipped aclImdb
folder now, there are 100,024 (!) files.
Thanks for the details. That's interesting, so basically most of the time is spend on the unzipping (3 out of the 3-5 min on windows).
In my case it was a bit quicker:
29 seconds on my laptop macOS
5 seconds on Google Colab (Ubuntu)
So maybe the Windows filesystem is maybe not ideal for this large amount of small files. Yes, it's a lot of files, I think 50k based on the description: https://ai.stanford.edu/~amaas/data/sentiment/
This was the dataset that I originally used for Chapter 6, but I already had a suspicion that it might test the readers' patience 😅, which is why I swapped it with a smaller one that is easier to work with.
So, I think there is fundamentally no issue anymore after adding your fixes, correct? So I will close this issue. (But please correct me if I'm wrong, and thanks for these additional insights on the runtimes!)
This might be still WIP, but I have issues reproducing the output in
ch06/03_bonus_imdb-classification
:gpt_download.py
andprevious_chapters.py
missing in folder, therefore cannot runpython train-gpt.py
as instructed in the READMEpython download-prepare-dataset.py
does not correctly create the test and validation set (train set seems to be fine though):ch06/02_bonus_additional-experiments
toch06/03_bonus_imdb-classification
, runningpython train-gpt.py
results in a val loss of NaNs:python train-bert-hf.py
andpython train-sklearn-logreg.py
-> insteatd ofval.csv
should bevalidation.csv
intrain-sklearn-logreg.py
(as defined indownload-prepare-dataset.py
)