Closed jamesdhope closed 5 months ago
I'm happy to take a crack at this, but I'm not sure I understand the issue.
the HF from datasets import load_dataset
What does HF mean?
with the dependency workaround:
import torch torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() import torchdata
Does this need to go in the pytorch
source code or in the tutorial itself?
I tried a simple sniff test:
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])
def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
"""Converts raw text into a flat Tensor."""
data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))
# ``train_iter`` was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def batchify(data: Tensor, bsz: int) -> Tensor:
"""Divides the data into ``bsz`` separate sequences, removing extra elements
that wouldn't cleanly fit.
Arguments:
data: Tensor, shape ``[N]``
bsz: int, batch size
Returns:
Tensor of shape ``[N // bsz, bsz]``
"""
seq_len = data.size(0) // bsz
data = data[:seq_len * bsz]
data = data.view(bsz, seq_len).t().contiguous()
return data.to(device)
batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size) # shape ``[seq_len, batch_size]``
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)
and got the following error in the Colab notebook:
HTTPError Traceback (most recent call last)
[<ipython-input-12-0398103be9c1>](https://localhost:8080/#) in <cell line: 10>()
8 train_iter = WikiText2(split='train')
9 tokenizer = get_tokenizer('basic_english')
---> 10 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
11 vocab.set_default_index(vocab['<unk>'])
12
54 frames
[/usr/local/lib/python3.10/dist-packages/requests/models.py](https://localhost:8080/#) in raise_for_status(self)
1019
1020 if http_error_msg:
-> 1021 raise HTTPError(http_error_msg, response=self)
1022
1023 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
This exception is thrown by __iter__ of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)
Also looks like advanced_source/ddp_pipeline.py
might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)
Hey LoganThanks for your reply. Fix needs to go to the tutorial, not the source.The issue is the source is not accessible. I am suggesting to replace with a Hugging Face open dataset, such as WikiText2.JamesOn 5 Jun 2024, at 04:30, Logan Thomas @.***> wrote: I'm happy to take a crack at this, but I'm not sure I understand the issue.
the HF from datasets import load_dataset
What does HF mean?
with the dependency workaround: import torch torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() import torchdata
Does this need to go in the pytorch source code or in the tutorial itself? I tried a simple sniff test: torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() import torchdata
from torchtext.datasets import WikiText2 from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator
train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['
def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor: """Converts raw text into a flat Tensor.""" data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter] return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))
train_iter
was "consumed" by the process of building the vocab,train_iter, val_iter, test_iter = WikiText2() train_data = data_process(train_iter) val_data = data_process(val_iter) test_data = data_process(test_iter)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def batchify(data: Tensor, bsz: int) -> Tensor:
"""Divides the data into bsz
separate sequences, removing extra elements
that wouldn't cleanly fit.
Arguments:
data: Tensor, shape ``[N]``
bsz: int, batch size
Returns:
Tensor of shape ``[N // bsz, bsz]``
"""
seq_len = data.size(0) // bsz
data = data[:seq_len * bsz]
data = data.view(bsz, seq_len).t().contiguous()
return data.to(device)
batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size) # shape [seq_len, batch_size]
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)
and got the following error in the Colab notebook:
HTTPError Traceback (most recent call last)
54 frames /usr/local/lib/python3.10/dist-packages/requests/models.py in raise_for_status(self) 1019 1020 if http_error_msg: -> 1021 raise HTTPError(http_error_msg, response=self) 1022 1023 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip This exception is thrown by iter of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)
Also looks like advanced_source/ddp_pipeline.py might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
/assigntome
We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content:
Language Modeling with nn.Transformer and torchtext
==========================================
The content is deprecated.
.. raw:: html
<meta http-equiv="refresh" content="0; url=https://pytorch.org/tutorials/">
Torchtext is only used for the vocab and tokenizer. Could this be swapped out for an alternative library?
On Wed, Jun 5, 2024 at 5:24 PM Svetlana Karslioglu @.***> wrote:
We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content:
Language Modeling with nn.Transformer and torchtext
The content is deprecated.
.. raw:: html
— Reply to this email directly, view it on GitHub https://github.com/pytorch/tutorials/issues/2895#issuecomment-2150475701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYKSML5VTBBEVR3E36HZM3ZF43URAVCNFSM6AAAAABIUFKWW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGQ3TKNZQGE . You are receiving this because you authored the thread.Message ID: @.***>
I do believe the purpose of this tutorial was to use torchtext with pytorch. We actually don't have the source file of the tutorial in the repo anymore.
@jamesdhope I'll be submitting a PR shortly to deprecate this tutorial. However, it does look like other tutorials make use of the Wikitext-2
dataset without torchtext
:
Issue and Suggested Fix
Please can this helpful tutorial be updated with the HF
from datasets import load_dataset
and merged into main with the dependency issue workaround:Asset
https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/9cf2d4ead514e661e20d2070c9bf7324/transformer_tutorial.ipynb#scrollTo=TY5T9Gic_qih
Describe the bug
Describe your environment
Google Colab environment. Have replicated the issue locally with same pip package versions.
cc @sekyondaMeta @svekars @kit1980