Open daisylab opened 4 years ago
The single_line
variable is applicable to WikiText103 only for now. @daisylab Could you explain here what you propose to change or the behavior you want.
OK. Here is a different story. Please let me explain that.
First, check the environment. I use Ubuntu 18.04, python installed from .deb, and most of the python packages installed from pip.
And of course I have PyTorch and torchtext.
(helloworld) sungjin@ailab:~> cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS"
(helloworld) sungjin@ailab:~> python -V
Python 3.6.9
(helloworld) sungjin@ailab:~> pip list | grep torch
torch 1.6.0+cpu
torchtext 0.7.0
torchvision 0.7.0+cpu
And make a fresh directory to work on. (Oh, please don't mind the name of virtualenv, me, and my lab.)
(helloworld) sungjin@ailab:~> mkdir wiki
(helloworld) sungjin@ailab:~> cd wiki
(helloworld) sungjin@ailab:~/wiki> ls -a
./ ../
Now import WikiText2
module. Please notice that at this time, I disconnected the network on purpose.
(helloworld) sungjin@ailab:~/wiki> python
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torchtext.experimental.datasets import WikiText2
>>> train_dataset, test_dataset, valid_dataset = (
... WikiText2(single_line=False))
Traceback (most recent call last):
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/util/connection.py", line 61, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 381, in _make_request
self._validate_conn(conn)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn
conn.connect()
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connection.py", line 309, in connect
conn = self._new_conn()
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connection.py", line 172, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/connectionpool.py", line 727, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/urllib3/util/retry.py", line 439, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/language_modeling.py", line 146, in WikiText2
return _setup_datasets(*(("WikiText2",) + args), **kwargs)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/language_modeling.py", line 88, in _setup_datasets
train, test, valid = raw.DATASETS[dataset_name](root=root, data_select=('train', 'test', 'valid'))
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/raw/language_modeling.py", line 113, in WikiText2
return _setup_datasets(*(("WikiText2",) + args), **kwargs)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/experimental/datasets/raw/language_modeling.py", line 73, in _setup_datasets
dataset_tar = download_from_url(URLS[dataset_name], root=root)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/torchtext/utils.py", line 104, in download_from_url
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}, stream=True)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/sungjin/.virtualenvs/helloworld/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Of course this error is intended, to show that where the module gets the data.
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4e1ab98470>: Failed to establish a new connection: [Errno -2] Name or service not known',))
OK. It seems the file will be downloaded from amazonaws. Let's try again. At this time, the network connection is activated.
(helloworld) sungjin@ailab:~/wiki> python
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torchtext.experimental.datasets import WikiText2
>>> train_dataset, test_dataset, valid_dataset = (
... WikiText2(single_line=False))
wikitext-2-v1.zip: 100%|███████████████████| 4.48M/4.48M [00:02<00:00, 1.56MB/s]
36718lines [00:00, 120232.44lines/s]
So we have a corpus of 36718 lines. I'll show that in other way too.
The downloaded corpus is located current working directory.
(helloworld) sungjin@ailab:~/wiki> ls -a
./ ../ .data/
(helloworld) sungjin@ailab:~/wiki> ls .data
wikitext-2/ wikitext-2-v1.zip
(helloworld) sungjin@ailab:~/wiki> ls .data/wikitext-2
wiki.test.tokens wiki.train.tokens wiki.valid.tokens
Now let's investigate it. Here comes our old friend, wc
.
(helloworld) sungjin@ailab:~/wiki> wc .data/wikitext-2/wiki.train.tokens
36718 2051910 10797148 .data/wikitext-2/wiki.train.tokens
Again, we can confirm that the corpus has 36718 lines (more specifically, for training set).
Let's see what actual data looks like.
(helloworld) sungjin@ailab:~/wiki> head -8 .data/wikitext-2/wiki.train.tokens
= Valkyria Chronicles III =
Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .
The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 .
= = Gameplay = =
To sum up, what I was trying to show you is that the WikiText2 corpus has many lines, not a single_line
. So please reconsider your opinion. Thanks.
OK. Here is a different story. Please let me explain that.
First, check the environment. I use Ubuntu 18.04, python installed from .deb, and most of the python packages installed from pip.
And of course I have PyTorch and torchtext.
(helloworld) sungjin@ailab:~> cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu 18.04.5 LTS" ...
@daisylab Thanks for the throughout explanation. I have updated my previous comment and will consider this issue.
Hi. It's about torchtext 0.7.0 version (from pip).
According to the documentation (https://pytorch.org/text/experimental_datasets.html#wikitext-2, and docstring), the experimental WikiText2 has a parameter called 'single_line'. The usage says;
However, in the actual code, this raises an error on the WikiText2 (actually, all but WikiText103), that is, we have in the code next line;
if not single_line and dataset_name != 'WikiText103': raise TypeError('single_line must be True except for WikiText103')
I think this is not a big deal, and maybe already corrected. However, at least for me, at the current time, this makes me (and my code) a little bit complicated.
I hope this subtle issue will be resolved soon. :)