pytorch / tutorials

PyTorch tutorials.
https://pytorch.org/tutorials/
BSD 3-Clause "New" or "Revised" License
8.22k stars 4.07k forks source link

[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895

Closed jamesdhope closed 5 months ago

jamesdhope commented 5 months ago

Issue and Suggested Fix

Please can this helpful tutorial be updated with the HF from datasets import load_dataset and merged into main with the dependency issue workaround:

import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

Asset

https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/9cf2d4ead514e661e20d2070c9bf7324/transformer_tutorial.ipynb#scrollTo=TY5T9Gic_qih

Describe the bug

ImportError                               Traceback (most recent call last)
[<ipython-input-26-b02c7921f3b1>](https://localhost:8080/#) in <cell line: 5>()
      3 from torchtext.vocab import build_vocab_from_iterator
      4 
----> 5 train_iter = WikiText2(split='train')
      6 tokenizer = get_tokenizer('basic_english')
      7 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])

6 frames
[/usr/local/lib/python3.10/dist-packages/torchdata/datapipes/iter/util/cacheholder.py](https://localhost:8080/#) in <module>
     22     portalocker = None
     23 
---> 24 from torch.utils.data.datapipes.utils.common import _check_unpickable_fn, DILL_AVAILABLE
     25 
     26 from torch.utils.data.graph import traverse_dps

ImportError: cannot import name 'DILL_AVAILABLE' from 'torch.utils.data.datapipes.utils.common' (/usr/local/lib/python3.10/dist-packages/torch/utils/data/datapipes/utils/common.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Describe your environment

Google Colab environment. Have replicated the issue locally with same pip package versions.

cc @sekyondaMeta @svekars @kit1980

loganthomas commented 5 months ago

I'm happy to take a crack at this, but I'm not sure I understand the issue.

the HF from datasets import load_dataset

What does HF mean?

with the dependency workaround:

import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

Does this need to go in the pytorch source code or in the tutorial itself?

I tried a simple sniff test:

torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

# ``train_iter`` was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into ``bsz`` separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Arguments:
        data: Tensor, shape ``[N]``
        bsz: int, batch size

    Returns:
        Tensor of shape ``[N // bsz, bsz]``
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape ``[seq_len, batch_size]``
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

and got the following error in the Colab notebook:

HTTPError                                 Traceback (most recent call last)
[<ipython-input-12-0398103be9c1>](https://localhost:8080/#) in <cell line: 10>()
      8 train_iter = WikiText2(split='train')
      9 tokenizer = get_tokenizer('basic_english')
---> 10 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
     11 vocab.set_default_index(vocab['<unk>'])
     12 

54 frames
[/usr/local/lib/python3.10/dist-packages/requests/models.py](https://localhost:8080/#) in raise_for_status(self)
   1019 
   1020         if http_error_msg:
-> 1021             raise HTTPError(http_error_msg, response=self)
   1022 
   1023     def close(self):

HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
This exception is thrown by __iter__ of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

Also looks like advanced_source/ddp_pipeline.py might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)

jamesdhope commented 5 months ago

Hey LoganThanks for your reply. Fix needs to go to the tutorial, not the source.The issue is the source is not accessible. I am suggesting to replace with a Hugging Face open dataset, such as WikiText2.JamesOn 5 Jun 2024, at 04:30, Logan Thomas @.***> wrote: I'm happy to take a crack at this, but I'm not sure I understand the issue.

the HF from datasets import load_dataset

What does HF mean?

with the dependency workaround: import torch torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() import torchdata

Does this need to go in the pytorch source code or in the tutorial itself? I tried a simple sniff test: torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() import torchdata

from torchtext.datasets import WikiText2 from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator

train_iter = WikiText2(split='train') tokenizer = get_tokenizer('basic_english') vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['']) vocab.set_default_index(vocab[''])

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor: """Converts raw text into a flat Tensor.""" data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter] return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

train_iter was "consumed" by the process of building the vocab,

so we have to create it again

train_iter, val_iter, test_iter = WikiText2() train_data = data_process(train_iter) val_data = data_process(val_iter) test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data: Tensor, bsz: int) -> Tensor: """Divides the data into bsz separate sequences, removing extra elements that wouldn't cleanly fit.

Arguments:
    data: Tensor, shape ``[N]``
    bsz: int, batch size

Returns:
    Tensor of shape ``[N // bsz, bsz]``
"""
seq_len = data.size(0) // bsz
data = data[:seq_len * bsz]
data = data.view(bsz, seq_len).t().contiguous()
return data.to(device)

batch_size = 20 eval_batch_size = 10 train_data = batchify(train_data, batch_size) # shape [seq_len, batch_size] val_data = batchify(val_data, eval_batch_size) test_data = batchify(test_data, eval_batch_size) and got the following error in the Colab notebook: HTTPError Traceback (most recent call last) in <cell line: 10>() 8 train_iter = WikiText2(split='train') 9 tokenizer = get_tokenizer('basic_english') ---> 10 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['']) 11 vocab.set_default_index(vocab['']) 12

54 frames /usr/local/lib/python3.10/dist-packages/requests/models.py in raise_for_status(self) 1019 1020 if http_error_msg: -> 1021 raise HTTPError(http_error_msg, response=self) 1022 1023 def close(self):

HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip This exception is thrown by iter of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

Also looks like advanced_source/ddp_pipeline.py might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

loganthomas commented 5 months ago

/assigntome

svekars commented 5 months ago

We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content:

Language Modeling with nn.Transformer and torchtext
==========================================

The content is deprecated.

.. raw:: html
   <meta http-equiv="refresh" content="0; url=https://pytorch.org/tutorials/">
jamesdhope commented 5 months ago

Torchtext is only used for the vocab and tokenizer. Could this be swapped out for an alternative library?

On Wed, Jun 5, 2024 at 5:24 PM Svetlana Karslioglu @.***> wrote:

We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content:

Language Modeling with nn.Transformer and torchtext

The content is deprecated.

.. raw:: html

— Reply to this email directly, view it on GitHub https://github.com/pytorch/tutorials/issues/2895#issuecomment-2150475701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYKSML5VTBBEVR3E36HZM3ZF43URAVCNFSM6AAAAABIUFKWW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGQ3TKNZQGE . You are receiving this because you authored the thread.Message ID: @.***>

svekars commented 5 months ago

I do believe the purpose of this tutorial was to use torchtext with pytorch. We actually don't have the source file of the tutorial in the repo anymore.

loganthomas commented 5 months ago

@jamesdhope I'll be submitting a PR shortly to deprecate this tutorial. However, it does look like other tutorials make use of the Wikitext-2 dataset without torchtext: