pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.5k stars 813 forks source link

An inconsistency in torchtext.experimental.datasets.LanguageModelingDataset() #1017

Open daisylab opened 4 years ago

daisylab commented 4 years ago

Hi. It's about torchtext 0.7.0 version (from pip).

According to the documentation (https://pytorch.org/text/experimental_datasets.html#wikitext-2, and docstring), the experimental WikiText2 has a parameter called 'single_line'. The usage says;

single_line – whether to return all tokens in a single line. (Default: True) By default, all lines in raw text file are concatenated into a single line. Use single_line = False if one wants to get data line by line.

However, in the actual code, this raises an error on the WikiText2 (actually, all but WikiText103), that is, we have in the code next line;

if not single_line and dataset_name != 'WikiText103': raise TypeError('single_line must be True except for WikiText103')

I think this is not a big deal, and maybe already corrected. However, at least for me, at the current time, this makes me (and my code) a little bit complicated.

I hope this subtle issue will be resolved soon. :)

zhangguanheng66 commented 4 years ago

The single_line variable is applicable to WikiText103 only for now. @daisylab Could you explain here what you propose to change or the behavior you want.

daisylab commented 3 years ago

OK. Here is a different story. Please let me explain that.

First, check the environment. I use Ubuntu 18.04, python installed from .deb, and most of the python packages installed from pip.

And of course I have PyTorch and torchtext.

(helloworld) sungjin@ailab:~> cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"

(helloworld) sungjin@ailab:~> python -V
Python 3.6.9

(helloworld) sungjin@ailab:~> pip list | grep torch
torch              1.6.0+cpu
torchtext          0.7.0
torchvision        0.7.0+cpu

And make a fresh directory to work on. (Oh, please don't mind the name of virtualenv, me, and my lab.)

(helloworld) sungjin@ailab:~> mkdir wiki
(helloworld) sungjin@ailab:~> cd wiki   
(helloworld) sungjin@ailab:~/wiki> ls -a     
./  ../

Now import WikiText2 module. Please notice that at this time, I disconnected the network on purpose.

(helloworld) sungjin@ailab:~/wiki> python
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torchtext.experimental.datasets import WikiText2
>>> train_dataset, test_dataset, valid_dataset = (
...     WikiText2(single_line=False))
Traceback (most recent call last):
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/util/connection.py", line 61, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connection.py", line 309, in connect
    conn = self._new_conn()
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connection.py", line 172, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/language_modeling.py", line 146, in WikiText2
    return _setup_datasets(*(("WikiText2",) + args), **kwargs)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/language_modeling.py", line 88, in _setup_datasets
    train, test, valid = raw.DATASETS[dataset_name](root=root, data_select=('train', 'test', 'valid'))
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/raw/language_modeling.py", line 113, in WikiText2
    return _setup_datasets(*(("WikiText2",) + args), **kwargs)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/raw/language_modeling.py", line 73, in _setup_datasets
    dataset_tar = download_from_url(URLS[dataset_name], root=root)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/utils.py", line 104, in download_from_url
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}, stream=True)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Of course this error is intended, to show that where the module gets the data.

raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known',))

OK. It seems the file will be downloaded from amazonaws. Let's try again. At this time, the network connection is activated.

(helloworld) sungjin@ailab:~/wiki> python
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torchtext.experimental.datasets import WikiText2
>>> train_dataset, test_dataset, valid_dataset = (
...     WikiText2(single_line=False))
wikitext-2-v1.zip: 100%|███████████████████| 4.48M/4.48M [00:02<00:00, 1.56MB/s]
36718lines [00:00, 120232.44lines/s]

So we have a corpus of 36718 lines. I'll show that in other way too.

The downloaded corpus is located current working directory.

(helloworld) sungjin@ailab:~/wiki> ls -a
./  ../  .data/
(helloworld) sungjin@ailab:~/wiki> ls .data
wikitext-2/  wikitext-2-v1.zip
(helloworld) sungjin@ailab:~/wiki> ls .data/wikitext-2
wiki.test.tokens  wiki.train.tokens  wiki.valid.tokens

Now let's investigate it. Here comes our old friend, wc.

(helloworld) sungjin@ailab:~/wiki> wc .data/wikitext-2/wiki.train.tokens 
   36718  2051910 10797148 .data/wikitext-2/wiki.train.tokens

Again, we can confirm that the corpus has 36718 lines (more specifically, for training set).

Let's see what actual data looks like.

(helloworld) sungjin@ailab:~/wiki> head -8 .data/wikitext-2/wiki.train.tokens

 = Valkyria Chronicles III = 

 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 
 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . 
 It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 . 

 = = Gameplay = = 

To sum up, what I was trying to show you is that the WikiText2 corpus has many lines, not a single_line. So please reconsider your opinion. Thanks.

zhangguanheng66 commented 3 years ago

OK. Here is a different story. Please let me explain that.

First, check the environment. I use Ubuntu 18.04, python installed from .deb, and most of the python packages installed from pip.

And of course I have PyTorch and torchtext.


(helloworld) sungjin@ailab:~> cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
...

@daisylab Thanks for the throughout explanation. I have updated my previous comment and will consider this issue.