salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch
BSD 3-Clause "New" or "Revised" License
1.96k stars 491 forks source link

Low number of unique words predicted #21

Open mocialov opened 6 years ago

mocialov commented 6 years ago

I would like to perform a sanity check by passing some input to the model and reading the output text.

Following the PyTorch tutorial on language modelling (https://github.com/pytorch/examples/blob/master/word_language_model/generate.py), I have edited the evaluate function:

def evaluate(data_source, batch_size=10):
    # Turn on evaluation mode which disables dropout.
    if args.model == 'QRNN': model.reset()
    model.eval()
    total_loss = 0
    ntokens = len(corpus.dictionary)
    hidden = model.init_hidden(batch_size)
    for i in range(0, data_source.size(0) - 1, args.bptt):
        data, targets = get_batch(data_source, i, args, evaluation=True)

        print ("inputs")
        inp = data.cpu().data.numpy()
        for input_ in inp:
            print ([created_inverse_tokenizer_during_training[i] for i in input_])

        output, hidden = model(data, hidden)

        word_weights = output.squeeze().data.div(args.temperature).exp().cpu()
        word_idx = torch.multinomial(word_weights, 10)

        print ("outputs")
        for word_ in word_idx:
            for item_ in word_:
                print ("next word", created_inverse_tokenizer_during_training[item_])
            print ("")

        output_flat = output.view(-1, ntokens)
        total_loss += len(data) * criterion(output_flat, targets).data
        hidden = repackage_hidden(hidden)
    return total_loss[0] / len(data_source)

, where created_inverse_tokenizer_during_training is idx2word from Dictionary class

I am testing on ptb dataset and I get the following with approximately 60 perplexity value:

inputs: [made, value, $, their, intends, N, also, south, , or] [much, criteria, N, office, to, return, closed, as, one, $] [difference, devised, billion, visits, restrict, on, sharply, it, analyst, N] [in, by, , as, the, assets, lower, became, peter, a] [liquidity, benjamin, a, , rtc, for, across, more, , share] [in, graham, , breaks, to, security, europe, clear, of, in] [the, an, , , treasury, pacific, particularly, that, , the] [pit, analyst, by, but, borrowings, and, in, a, &, fiscal] [, and, an, massage, only, an, frankfurt, repeat, co., year] [it, author, , no, unless, N, although, of, new, just] ["s", in, not, matter, the, N, london, the, york, ended] [too, the, , how, agency, return, and, october, said, up] [soon, 1930s, though, , receives, on, a, N, the, from] [to, and, , is, specific, equity, few, crash, gold, $] [tell, , , still, congressional, , other, was, market, N] [but, who, english, associated, authorization, the, markets, "nt", already, million] [people, is, butler, in, , loan, recovered, at, had, in] [do, widely, in, many, such, growth, some, hand, some, fiscal] ["nt", considered, his, minds, agency, offset, ground, , good, N] [seem, to, , with, , continuing, after, professionals, , and] [to, be, proceeds, , borrowing, real-estate, stocks, dominated, technical, $] [be, the, as, fronts, is, loan, began, municipal, factors, N] [unhappy, father, if, for, unauthorized, losses, to, trading, that, million] [with, of, the, , and, in, rebound, throughout, would, in] [it, modern, realistic, and, expensive, the, in, the, have, N]

outputs: [berlitz, hydro-quebec, banknote, centrust, gitano, cluett, guterman, aer, fromstein, calloway] [berlitz, centrust, cluett, fromstein, aer, gitano, hydro-quebec, guterman, calloway, banknote] [banknote, hydro-quebec, calloway, fromstein, berlitz, gitano, cluett, aer, guterman, centrust] [calloway, berlitz, cluett, centrust, aer, gitano, hydro-quebec, banknote, guterman, fromstein] [fromstein, hydro-quebec, aer, banknote, gitano, berlitz, calloway, cluett, centrust, guterman] [calloway, hydro-quebec, guterman, fromstein, berlitz, banknote, cluett, centrust, gitano, aer] [gitano, fromstein, hydro-quebec, cluett, calloway, centrust, berlitz, guterman, aer, banknote] [berlitz, gitano, banknote, cluett, calloway, aer, centrust, fromstein, hydro-quebec, guterman] [calloway, gitano, guterman, berlitz, centrust, hydro-quebec, cluett, aer, fromstein, banknote] [hydro-quebec, berlitz, fromstein, gitano, cluett, calloway, aer, centrust, guterman, banknote] [aer, cluett, fromstein, berlitz, guterman, calloway, hydro-quebec, centrust, banknote, gitano] [cluett, calloway, centrust, fromstein, banknote, gitano, guterman, hydro-quebec, aer, berlitz] [hydro-quebec, fromstein, calloway, aer, banknote, berlitz, cluett, gitano, centrust, guterman] [banknote, gitano, aer, centrust, cluett, fromstein, calloway, guterman, hydro-quebec, berlitz] [calloway, aer, gitano, berlitz, fromstein, cluett, guterman, banknote, hydro-quebec, centrust] [banknote, cluett, fromstein, berlitz, gitano, aer, centrust, calloway, hydro-quebec, guterman] [cluett, fromstein, aer, calloway, guterman, banknote, berlitz, gitano, centrust, hydro-quebec] [aer, guterman, berlitz, gitano, centrust, cluett, calloway, hydro-quebec, fromstein, banknote] [centrust, fromstein, cluett, berlitz, aer, banknote, guterman, gitano, calloway, hydro-quebec] [guterman, banknote, fromstein, cluett, gitano, calloway, aer, centrust, berlitz, hydro-quebec] [calloway, berlitz, aer, banknote, hydro-quebec, fromstein, cluett, guterman, gitano, centrust] [banknote, hydro-quebec, berlitz, fromstein, guterman, calloway, cluett, centrust, gitano, aer] [centrust, aer, fromstein, cluett, hydro-quebec, calloway, gitano, berlitz, guterman, banknote] [fromstein, centrust, aer, banknote, berlitz, guterman, gitano, hydro-quebec, calloway, cluett] [cluett, banknote, hydro-quebec, gitano, berlitz, fromstein, calloway, guterman, centrust, aer]

As you can see, the number of unique words in the output is rather small. Why is that? Or am I doing it wrong?

andrewPoulton commented 5 years ago

Probably related to this