Closed keon closed 7 years ago
If you just want pure character level, pass the list function to the Field constructor’s tokenize parameter (SubwordField isn’t necessary if you just want characters)
Like you said, passing just list function solved the problem :)
TEXT = data.Field(fix_length=100, tokenize=list)
LABEL = data.Field(sequential=False)
train, test = datasets.IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train)
LABEL.build_vocab(train)
train_iter, test_iter = data.BucketIterator.splits(
(train, test), batch_size=1, repeat=False)
for b, batch in enumerate(train_iter):
x, y = batch.text, batch.label
sample = "".join(TEXT.vocab.itos[s[0]] for s in x.data)
print(sample, len(sample))
break
gives
Since musicals have both gone out of fashion and are incredibly expensive to make without all the ta 100
which is perfect. Thanks!
I want to analyze imdb dataset in subword (character) level. so i tried following;
the code above gives me
I used the fresh fetch versions of pytorch/text and revtok.