Open mocialov opened 6 years ago
I would like to perform a sanity check by passing some input to the model and reading the output text.
Following the PyTorch tutorial on language modelling (https://github.com/pytorch/examples/blob/master/word_language_model/generate.py), I have edited the evaluate function:
evaluate
def evaluate(data_source, batch_size=10): # Turn on evaluation mode which disables dropout. if args.model == 'QRNN': model.reset() model.eval() total_loss = 0 ntokens = len(corpus.dictionary) hidden = model.init_hidden(batch_size) for i in range(0, data_source.size(0) - 1, args.bptt): data, targets = get_batch(data_source, i, args, evaluation=True) print ("inputs") inp = data.cpu().data.numpy() for input_ in inp: print ([created_inverse_tokenizer_during_training[i] for i in input_]) output, hidden = model(data, hidden) word_weights = output.squeeze().data.div(args.temperature).exp().cpu() word_idx = torch.multinomial(word_weights, 10) print ("outputs") for word_ in word_idx: for item_ in word_: print ("next word", created_inverse_tokenizer_during_training[item_]) print ("") output_flat = output.view(-1, ntokens) total_loss += len(data) * criterion(output_flat, targets).data hidden = repackage_hidden(hidden) return total_loss[0] / len(data_source)
, where created_inverse_tokenizer_during_training is idx2word from Dictionary class
created_inverse_tokenizer_during_training
idx2word
Dictionary
I am testing on ptb dataset and I get the following with approximately 60 perplexity value:
inputs: [made, value, $, their, intends, N, also, south, , or] [much, criteria, N, office, to, return, closed, as, one, $] [difference, devised, billion, visits, restrict, on, sharply, it, analyst, N] [in, by, , as, the, assets, lower, became, peter, a] [liquidity, benjamin, a, , rtc, for, across, more, , share] [in, graham, , breaks, to, security, europe, clear, of, in] [the, an, , , treasury, pacific, particularly, that, , the] [pit, analyst, by, but, borrowings, and, in, a, &, fiscal] [, and, an, massage, only, an, frankfurt, repeat, co., year] [it, author, , no, unless, N, although, of, new, just] ["s", in, not, matter, the, N, london, the, york, ended] [too, the, , how, agency, return, and, october, said, up] [soon, 1930s, though, , receives, on, a, N, the, from] [to, and, , is, specific, equity, few, crash, gold, $] [tell, , , still, congressional, , other, was, market, N] [but, who, english, associated, authorization, the, markets, "nt", already, million] [people, is, butler, in, , loan, recovered, at, had, in] [do, widely, in, many, such, growth, some, hand, some, fiscal] ["nt", considered, his, minds, agency, offset, ground, , good, N] [seem, to, , with, , continuing, after, professionals, , and] [to, be, proceeds, , borrowing, real-estate, stocks, dominated, technical, $] [be, the, as, fronts, is, loan, began, municipal, factors, N] [unhappy, father, if, for, unauthorized, losses, to, trading, that, million] [with, of, the, , and, in, rebound, throughout, would, in] [it, modern, realistic, and, expensive, the, in, the, have, N]
outputs: [berlitz, hydro-quebec, banknote, centrust, gitano, cluett, guterman, aer, fromstein, calloway] [berlitz, centrust, cluett, fromstein, aer, gitano, hydro-quebec, guterman, calloway, banknote] [banknote, hydro-quebec, calloway, fromstein, berlitz, gitano, cluett, aer, guterman, centrust] [calloway, berlitz, cluett, centrust, aer, gitano, hydro-quebec, banknote, guterman, fromstein] [fromstein, hydro-quebec, aer, banknote, gitano, berlitz, calloway, cluett, centrust, guterman] [calloway, hydro-quebec, guterman, fromstein, berlitz, banknote, cluett, centrust, gitano, aer] [gitano, fromstein, hydro-quebec, cluett, calloway, centrust, berlitz, guterman, aer, banknote] [berlitz, gitano, banknote, cluett, calloway, aer, centrust, fromstein, hydro-quebec, guterman] [calloway, gitano, guterman, berlitz, centrust, hydro-quebec, cluett, aer, fromstein, banknote] [hydro-quebec, berlitz, fromstein, gitano, cluett, calloway, aer, centrust, guterman, banknote] [aer, cluett, fromstein, berlitz, guterman, calloway, hydro-quebec, centrust, banknote, gitano] [cluett, calloway, centrust, fromstein, banknote, gitano, guterman, hydro-quebec, aer, berlitz] [hydro-quebec, fromstein, calloway, aer, banknote, berlitz, cluett, gitano, centrust, guterman] [banknote, gitano, aer, centrust, cluett, fromstein, calloway, guterman, hydro-quebec, berlitz] [calloway, aer, gitano, berlitz, fromstein, cluett, guterman, banknote, hydro-quebec, centrust] [banknote, cluett, fromstein, berlitz, gitano, aer, centrust, calloway, hydro-quebec, guterman] [cluett, fromstein, aer, calloway, guterman, banknote, berlitz, gitano, centrust, hydro-quebec] [aer, guterman, berlitz, gitano, centrust, cluett, calloway, hydro-quebec, fromstein, banknote] [centrust, fromstein, cluett, berlitz, aer, banknote, guterman, gitano, calloway, hydro-quebec] [guterman, banknote, fromstein, cluett, gitano, calloway, aer, centrust, berlitz, hydro-quebec] [calloway, berlitz, aer, banknote, hydro-quebec, fromstein, cluett, guterman, gitano, centrust] [banknote, hydro-quebec, berlitz, fromstein, guterman, calloway, cluett, centrust, gitano, aer] [centrust, aer, fromstein, cluett, hydro-quebec, calloway, gitano, berlitz, guterman, banknote] [fromstein, centrust, aer, banknote, berlitz, guterman, gitano, hydro-quebec, calloway, cluett] [cluett, banknote, hydro-quebec, gitano, berlitz, fromstein, calloway, guterman, centrust, aer]
As you can see, the number of unique words in the output is rather small. Why is that? Or am I doing it wrong?
Probably related to this
I would like to perform a sanity check by passing some input to the model and reading the output text.
Following the PyTorch tutorial on language modelling (https://github.com/pytorch/examples/blob/master/word_language_model/generate.py), I have edited the
evaluate
function:, where
created_inverse_tokenizer_during_training
isidx2word
fromDictionary
classI am testing on ptb dataset and I get the following with approximately 60 perplexity value:
inputs: [made, value, $, their, intends, N, also, south,, or]
[much, criteria, N, office, to, return, closed, as, one, $]
[difference, devised, billion, visits, restrict, on, sharply, it, analyst, N]
[in, by, , as, the, assets, lower, became, peter, a]
[liquidity, benjamin, a, , rtc, for, across, more, , share]
[in, graham, , breaks, to, security, europe, clear, of, in]
[the, an, , , treasury, pacific, particularly, that, , the]
[pit, analyst, by, but, borrowings, and, in, a, &, fiscal]
[, and, an, massage, only, an, frankfurt, repeat, co., year]
[it, author, , no, unless, N, although, of, new, just]
["s", in, not, matter, the, N, london, the, york, ended]
[too, the, , how, agency, return, and, october, said, up]
[soon, 1930s, though, , receives, on, a, N, the, from]
[to, and, , is, specific, equity, few, crash, gold, $]
[tell, , , still, congressional, , other, was, market, N]
[but, who, english, associated, authorization, the, markets, "nt", already, million]
[people, is, butler, in, , loan, recovered, at, had, in]
[do, widely, in, many, such, growth, some, hand, some, fiscal]
["nt", considered, his, minds, agency, offset, ground, , good, N]
[seem, to, , with, , continuing, after, professionals, , and]
[to, be, proceeds, , borrowing, real-estate, stocks, dominated, technical, $]
[be, the, as, fronts, is, loan, began, municipal, factors, N]
[unhappy, father, if, for, unauthorized, losses, to, trading, that, million]
[with, of, the, , and, in, rebound, throughout, would, in]
[it, modern, realistic, and, expensive, the, in, the, have, N]
outputs: [berlitz, hydro-quebec, banknote, centrust, gitano, cluett, guterman, aer, fromstein, calloway] [berlitz, centrust, cluett, fromstein, aer, gitano, hydro-quebec, guterman, calloway, banknote] [banknote, hydro-quebec, calloway, fromstein, berlitz, gitano, cluett, aer, guterman, centrust] [calloway, berlitz, cluett, centrust, aer, gitano, hydro-quebec, banknote, guterman, fromstein] [fromstein, hydro-quebec, aer, banknote, gitano, berlitz, calloway, cluett, centrust, guterman] [calloway, hydro-quebec, guterman, fromstein, berlitz, banknote, cluett, centrust, gitano, aer] [gitano, fromstein, hydro-quebec, cluett, calloway, centrust, berlitz, guterman, aer, banknote] [berlitz, gitano, banknote, cluett, calloway, aer, centrust, fromstein, hydro-quebec, guterman] [calloway, gitano, guterman, berlitz, centrust, hydro-quebec, cluett, aer, fromstein, banknote] [hydro-quebec, berlitz, fromstein, gitano, cluett, calloway, aer, centrust, guterman, banknote] [aer, cluett, fromstein, berlitz, guterman, calloway, hydro-quebec, centrust, banknote, gitano] [cluett, calloway, centrust, fromstein, banknote, gitano, guterman, hydro-quebec, aer, berlitz] [hydro-quebec, fromstein, calloway, aer, banknote, berlitz, cluett, gitano, centrust, guterman] [banknote, gitano, aer, centrust, cluett, fromstein, calloway, guterman, hydro-quebec, berlitz] [calloway, aer, gitano, berlitz, fromstein, cluett, guterman, banknote, hydro-quebec, centrust] [banknote, cluett, fromstein, berlitz, gitano, aer, centrust, calloway, hydro-quebec, guterman] [cluett, fromstein, aer, calloway, guterman, banknote, berlitz, gitano, centrust, hydro-quebec] [aer, guterman, berlitz, gitano, centrust, cluett, calloway, hydro-quebec, fromstein, banknote] [centrust, fromstein, cluett, berlitz, aer, banknote, guterman, gitano, calloway, hydro-quebec] [guterman, banknote, fromstein, cluett, gitano, calloway, aer, centrust, berlitz, hydro-quebec] [calloway, berlitz, aer, banknote, hydro-quebec, fromstein, cluett, guterman, gitano, centrust] [banknote, hydro-quebec, berlitz, fromstein, guterman, calloway, cluett, centrust, gitano, aer] [centrust, aer, fromstein, cluett, hydro-quebec, calloway, gitano, berlitz, guterman, banknote] [fromstein, centrust, aer, banknote, berlitz, guterman, gitano, hydro-quebec, calloway, cluett] [cluett, banknote, hydro-quebec, gitano, berlitz, fromstein, calloway, guterman, centrust, aer]
As you can see, the number of unique words in the output is rather small. Why is that? Or am I doing it wrong?