senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

Score individual sentences in a python script #27

Closed chrisjbryant closed 7 years ago

chrisjbryant commented 7 years ago

Hi,

I've just started using TheanoLM and was wondering whether it's possible to preload a trained model and query it at the sentence level from within a python script (i.e. not command line). A simple example:

import theanolm
model = theanolm.load("/path/model.h5")
sent = "This is a sentence ."
prob = model.score(sent)

Whether it returns scores in terms of log probability or perplexity doesn't matter to me as long as there's some way to determine whether one sentence is "better" than another.

Hope you can help.

senarvi commented 7 years ago

I don't see any problem with that. But there's no such a simple interface for loading a model, because I didn't design it to be called from Python. Basically you would need to do the same thing that is done in score() function in theanolm/commands/score.py, i.e.

After creating those objects, you can compute the log probability of a sentence. As an example, see _score_utterances() function in the same file.

If that's something that would be useful in general, it could be encapsulated in a class (e.g. theanolm.Model) as you suggested.

chrisjbryant commented 7 years ago

Great, thanks! I figured it'd be possible, but haven't spent too much time looking at the code yet, so your "recipe" helps a lot. I do think this would be a useful feature in general. I'm currently using KenLM which has a python wrapper exactly for this problem, so it'd be great if there was something similar for your NN models.

senarvi commented 7 years ago

I agree, people could find that useful, and it would be very easy to do. If you decide to implement such a class in TheanoLM, please send a pull request. Most of the code can be copied from theanolm.commands.score.py. Afterwards, both score and decode commands can be changed to use that class, in order to avoid duplicate code. Otherwise, I can try to do that when I have some extra time, and let you test it.

chrisjbryant commented 7 years ago

As you're most familiar with the code, I'd guess it'd probably be faster and less painful if you could do it. That said, I'll also take a look and see if I can figure something out sooner. If I didn't, I'd feel bad for giving you extra work!

senarvi commented 7 years ago

After looking at the code, the solution turned out to be even easier. Could you checkout the develop branch and try if it works? I wrote short instructions here: https://github.com/senarvi/theanolm/blob/develop/docs/calling.rst

Please let me know if something isn't working or if the instructions are incorrect.

chrisjbryant commented 7 years ago

Thanks for looking at this so quickly. There was a small bug in textscorer.py but after I fixed it, it works! The bug: def score_line(line, vocabulary): -> def score_line(self, line, vocabulary): (line 255)

Here's what I ran:

import theanolm
import theanolm.scoring

model = theanolm.Network.from_file(path)
scorer = theanolm.scoring.TextScorer(model, ignore_unk=True) # False also worked as intended
score = scorer.score_line("This is a sentence .", model.vocabulary)

I also got the same result from theanolm score <model> <text> --output score-utterances, so it seems to be working perfectly.

senarvi commented 7 years ago

Great! I'll fix that and create a new release.

chrisjbryant commented 7 years ago

Nice! I see in the docs you're still using the quick commands you gave me though. One error in those is that you have "network" in the 2nd line, but you actually want "model" from the first line. It's also not immediately obvious that you have to import theanolm.scoring separately in order to make the TextScorer, so I'd recommend including the import statements in the docs. It's probably easiest just add the other comments around the minimum working example I gave in my previous post.

senarvi commented 7 years ago

Thanks! I'll update the docs.

chrisjbryant commented 7 years ago

Another small bug: You need to update pip with the latest version again. The pip download currently can't find TextScorer cause it doesnt have the init.py change.

senarvi commented 7 years ago

I did that. Let me know if there are more problems.

chrisjbryant commented 7 years ago

Will do, but think it's all good now. Thanks again for your help!