Subword-level tokenization error while building vocab

keon commented 7 years ago

I want to analyze imdb dataset in subword (character) level. so i tried following;

TEXT = data.SubwordField(fix_length=100)
LABEL = data.Field(sequential=False)
train, test = datasets.IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train)
LABEL.build_vocab(train)
train_iter, test_iter = data.BucketIterator.splits(
        (train, test), batch_size=1, repeat=False)

the code above gives me

enumerating ngrams:   0%|          | 0/76292 [00:00<?, ?it/s]
For faster subwords, please install Julia 0.6, pyjulia, and Revtok.jl. Falling back to Python implementation...
enumerating ngrams: 100%|██████████| 76292/76292 [00:18<00:00, 4036.56it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-cec2da38fe78> in <module>()
----> 1 TEXT.build_vocab(train)
      2 LABEL.build_vocab(train)
      3 train_iter, test_iter = data.BucketIterator.splits(
      4         (train, test), batch_size=1, repeat=False)

/usr/local/lib/python3.5/dist-packages/torchtext-0.2.0b0-py3.5.egg/torchtext/data/field.py in build_vocab(self, *args, **kwargs)
    244                             self.eos_token]
    245             if tok is not None))
--> 246         self.vocab = self.vocab_cls(counter, specials=specials, **kwargs)
    247 
    248     def numericalize(self, arr, device=None, train=True):

/usr/local/lib/python3.5/dist-packages/torchtext-0.2.0b0-py3.5.egg/torchtext/vocab.py in __init__(self, counter, max_size, specials, vectors, unk_init, expand_vocab)
    189         self.itos = specials
    190 
--> 191         self.segment = revtok.SubwordSegmenter(counter, max_size)
    192 
    193         max_size = None if max_size is None else max_size + len(self.itos)

/usr/local/lib/python3.5/dist-packages/revtok-0.0.1-py3.5.egg/revtok/subwords.py in __init__(self, counter, max_size, force_python)
     93         ngrams.sort(key=attrgetter('text'))
     94         key = attrgetter('entropy')
---> 95         for i in tqdm(range(max_size - len(self.vocab)), 'building subword vocab'):
     96             ngrams.sort(key=key, reverse=True)
     97             best = ngrams[0]

TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

I used the fresh fetch versions of pytorch/text and revtok.

nelson-liu commented 7 years ago

If you just want pure character level, pass the list function to the Field constructor’s tokenize parameter (SubwordField isn’t necessary if you just want characters)

keon commented 7 years ago

Like you said, passing just list function solved the problem :)

TEXT = data.Field(fix_length=100, tokenize=list)
LABEL = data.Field(sequential=False)
train, test = datasets.IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train)
LABEL.build_vocab(train)
train_iter, test_iter = data.BucketIterator.splits(
        (train, test), batch_size=1, repeat=False)
for b, batch in enumerate(train_iter):
    x, y = batch.text, batch.label
    sample = "".join(TEXT.vocab.itos[s[0]] for s in x.data)
    print(sample, len(sample))
    break

gives

Since musicals have both gone out of fashion and are incredibly expensive to make without all the ta 100

which is perfect. Thanks!

pytorch / text

Subword-level tokenization error while building vocab #162